We were looking at the file "Mayo_Differential_Expression_(diagnosis).tsv" syn27024969 in harmonization study and hope to obtain the count/normalized count file that you used to generate the differential expression file. There is a file named "Mayo_Residualized_counts_(diagnosis).tsv", syn27024966 but it seems the file contains individuals not having AD. Accordign to the description of Mayo RNAseq study "These 278 subjects have the following pathological diagnoses: Alzheimer?s disease (AD), N=86, progressive supranuclear palsy (PSP), N=84, pathologic aging (PA), N=28, and control (CON), N=80.". The count file has a total of 259 individuals for TCX and 246 individuals for CBE, both much higher than the number ( 166) of individuals of con and AD.

Created by gyang24
Thanks for clarifying. @@jgockley @@jaclynbeck
Correct In the SageseqR pipeline the DE contrasts can be specified to specific diagnosis factor levels only in order to cut down the processing time required for pairwise comparisions. This keeps the estimation betas of other covariates such as sex and age of death, which are more precise with the greater indv. inclusion, but only compares the DE between the specified factor levels in the specified tissue as noted by the comparison column. Feel free to refer to: https://github.com/Sage-Bionetworks/sageseqr/blob/99c4118de8bc42de32ee86fc2c31e17ccb66b3bc/vignettes/customize-config.Rmd#L102C1-L125C70 ``` de contrasts: Required. primary: Required. Variable(s) in the metadata to define comparisons between groups. Currently must be either one numeric variable, or one or more catagorical variables. is_numeric_int: Optional. Specifies if there is a numeric interaction variable specified. default (FALSE) numeric: Optional. The numeric in variable which interacts with the primary variable(s). default (NULL) contrasts: Optional. A list specifying contrasts of the primary variable(s) to consider for differential sequencing results if using factor(s) as your primary variable. If not specified all combinations will be tested. If specified this will speed up the pipeline. Specify the contrast with the factor values involved in the contrast seperated by a hyphen. (eg for diagnosis, `contrasts: ["AD-CT"]` where AD is the value in diagnosis column for cases and CT is the value for controls. For multi-level contrasts, eg. `primary: ["diagnosis", "Sex"] would have contrasts specified as; `contrasts: ["ZZ_F-CT_F", "ZZ_M-CT_M"]` to look at cases vs controls in females and cases vs controls in males independently. While the order before or after the hyphen doesn't matter, the order of values before/after the underscore does matter. The value order must be the same as the `primary:` specification. eg. `primary: ["diagnosis","sex"]` must be CT_M while `primary: ["sex","diagnosis"]` must be M_CT. ``` Unfortunately, the actual configuration yaml for this run of the data is also not accessible at the moment however.
The DE file should be reliable since "diagnosis" was included in the model and you can extract differences for specific pairs of diagnoses in post-hoc analysis. Looking at the file itself, it looks like all comparisons are of the format "AD\_[tissue] - CT\_[tissue]", which only considers the gene expression difference between control and AD cases, ignoring the other diagnoses. I _believe_ the only effect the other diagnoses (PSP, PA) have on this file is in the adjusted p-value, where when adjusting for multiple comparisons the number of possible comparisons across all diagnoses is used instead of just the number of comparisons between AD and CT. So the adjusted p-value is likely to be more conservative than if only AD and CT samples were used in the differential expression. I hope that helps, Jaclyn
The total number in the file seems to have increased from their original description of 278, but there are patients with PSP and PA in there.
Thanks for the response. "So the file will contain all samples from all diagnoses", does that mean the data used to generate the DE file included PSP and PA as AD? If that's the case, does that mean the result in the DE file is not reliable?
Thanks @jaclynbeck !
I no longer have access to the data used to generate those outputs that are specified in the provenance. My guess is that ages over 90 (PHI protected and not released) are used in the normalization process. The most likely reason is a data update since the harmonization was created. Also we would've passed all data through the harmonization and left disease signal in the data for the general reprocessing ie not applicable for eQTL analysis without additional normalization. Please refer to the Mayo RNA-Seq Data (syn20827192) looks like there are 370 individuals with an RNA-Seq sample: ``` synapser::synLogin() foo <- read.csv(synapser::synGet('syn20827192')$path) length(table(foo$individualID)) # 370 ```
Hello! For all files named in the format "[study]\_Residualized\_counts([something]).tsv", the fields in parenthesis indicate that all covariates were regressed out of the data _except_ that one. So for the file in question, covariates like age, RIN, and sex have all been regressed out via linear model while the effects of "diagnosis" have been left in the residualized counts. So the file will contain all samples from all diagnoses. I hope that helps, let me know if you have any more questions. Jaclyn
yes, i updated the post.
The files you are referencing are: syn27024969 and syn27024966 correct?

syn27024966 Mayo_Residualized_counts_(diagnosis).tsv may contain PSP and PA individuals? page is loading…