Hi everyone, how are you?
I am trying to generate schizophrenia PRSs for a subset of the CMC samples, but I am having issues in doing so and wanted to see if anyone has seen something similar before.
Briefly, I am trying to calculate schizophrenia (EUR) PRSs by applying PRSice-2 on a subset of the CommonMind Consortium cohort (790 individuals, 44% cases, 63% EUR individuals), using the [Pardinas study] (https://pubmed.ncbi.nlm.nih.gov/29483656/) as the base GWAS. In total, samples from the selected cohort were collected across the 4 institutions, and genotyped across 4 Illumina chips. I imputed the QCd genotyped data per chip separately using a mixed reference population panel (1000G Phase 3) in the Michigan Imputation Server, merged/concatenated the imputed genotype files, QCd it using a [standard QC protocol](https://github.com/MareesAT/GWA_tutorial/) (filtering SNPs/individuals for missing genotype, excessive heterozygosity, relatedness, MAF, HWE, imputation INFO score, and removed multiallelic variants). I calculate the principal components of the 790 individuals in plink, and used the first 6 as population covariates (PCs) in the PRS calculation. Finally, I removed ambiguous, rare or MHC variants from the base GWAS.
However, when I run these files on the PRS software (PRSice-2) using the PCs as covariates, I get slightly odd results.
Here, I am trying to identify what's the best P value cut-off I should use (to include information from the base GWAS) to calculate my PRSs:
[SCZ PRS in a subset of the CMC](https://www.dropbox.com/s/djy9jtw70v1qrqh/PRS_SCZ_CMC_Pardinas.PNG?dl=0)
This bar plot is telling me that the best P-value threshold from the GWAS to include in the PRS calculation is 0.41, which seems too high (it's usually around 0.05 for schizophrenia). The weirdest part though is that it's also telling me that the PRS calculated using that P-value threshold accounts for over 20% (0.20) of variance in phenotype (R2), which seems much higher than what was reported previously on the Pardiņas paper (R2 = 0.12). A friend told me this could suggest some sample overlap (e.g. samples from the CMC being part of the PGC or Clozuk studies meta-analyzed in the Pardiņas GWAS - but I don't think this is the case as I went through the sample description of these papers and it doesn't look like they include the same cohorts - [although could be due to related individuals]).
Another thing is, if I include only EUR individuals, the R2 goes even higher.
[SCZ PRS in a subset of the CMC - EUR only](https://www.dropbox.com/s/x4vczaxabc77qep/PRS_SCZ_CMC_Pardinas_EUR_only.PNG?dl=0)
Anyway, just wanted to know if anyone here is working with the same goals, if you've seen this before, and how you fixed it (if it needs fixing).