Dear team, I am trying to work with the Imputed SNPs data in the HBTRC study. The study mentions that there are 463 or 549 samples in the data and covariates file has 572 samples. However, the imputed data seems to have 740 samples. How do I reconcile the two? The second question is the Imputed data directory has multiple "chunks" of each chromosome. Can you recommend a convenient method to integrate all the minimac "prob" "info" files related to the same chromosome to VCF or Plink file? Thanks. Best regards, Rajesh

Created by rrajesh
Thanks for a detailed and helpful reply.
Hello! I've been looking at this data a bit to answer your question but a lot of this is unclear still from the information I have. I can give you my best guess on what is going on: 1. The covariates file seems to be exclusively for the gene expression data that was also generated from this study, so there may not be perfect overlap between it and the SNP data. 2. The SNP data came from two different arrays: Illumina HumanHap650Y array and Perlegen 300K array, but data from both arrays was combined into one set of data (i.e. data from both arrays is in the info file). It's _possible_ that each array used a different ID for the same person, resulting in more "samples" than there are individual people. That might be a stretch though. 3. Based on past questions asked on this forum about the data, it looks like the authors have provided all the information on Synapse that they were going to, and they probably did not release a mapping from ID -> sample for the SNP data anywhere, nor the covariates information for these samples. See also the note on [this page](syn3159435). I think it's probably best to contact the authors of [the original paper](https://pubmed.ncbi.nlm.nih.gov/23622250/) and/or Dr. Ke Hao (cited as providing the imputed data on [this page](syn20808201)) to ask about the 740 samples vs. the stated 500 or so. In answer to your question about combining chunks, it looks like these are compressed text files, so you should be able to unzip them and concatenate all files from the same chromosome / same file type together using a bash script or something similar. I don't know of an easy way to convert these to Plink files, but you could search to see if there are tools that convert MaCH output into what you need. I hope that helps! Jaclyn
I checked based on known SNPs, the data is based on hg19.
Dear team, Also, can you please clarify if the chromosome position for each allele is based on hg19 or hg38? Thanks. Best regards, Rajesh

HBTRC study-syn3981980 page is loading…