Hi Synapse Team, I was wondering if you could give me more clarification for the .bam files of the ROSMAP ChIP-seq experiment with Synapse ID: [syn5958425](https://www.synapse.org/#!Synapse:syn5958425) stored in the database? I am well aware there is a data processing section as follows: "**Data Processing:** Short single-end reads were aligned against the human reference GRCh37/hg19 using the Burrows-Wheeler Aligner (BWA 0.7.4). Picard tools were used to sort bam files and mark duplicated reads. Bam files of the 712 samples are located in the folder ?Bam files? and named by project ID (see ROS/MAP key file). Additionally, the folder contains 7 positive control samples (?PC-Pool?) and 7 negative control samples (?NC-Pool?). Each control sample is a pool of either purified chromatin (positive control) or genomic DNA (negative control) from 7 individuals. The files ?PC-Pool.bam? and ?NC-Pool.bam? are merged bam files of all positive control samples or negative control samples respectively." Were these bam files already sorted using samtools and marked with duplicate reads using Picard or have they not undergone through these steps? The reason I ask this is I am planning to use another peak calling tool instead of MACS2, such as _Genrich_ and _Sicer2_ and when I proceed to removing duplicates either by 1) after sorting and then marking the duplicates of the bam files or 2) after I proceed with just removing the duplicate reads when using the original .bam files, I get a really, really small file size (~ 2.77 x E-6 GB), so I feel like I am doing something wrong... Could you assist me in the right direction as to how I go about processing these .bam files to get peaks an alternative way? That would super helpful. Thank you so much, Phoebe

Created by Phoebe Valdes prvaldes
Hi @abby.vanderlinden, Thank you for your reply back. I think after looking more closely at the headers of some of the .bam files I was able to figure out that they were already sorted using samtools and the duplicates were marked using Picard. I used the following command to do that: ``` samtools head R#######.bam ``` Header file would look like below for sample R#######.bam: ``` @HD VN:1.4 SO:coordinate @SQ SN:1 LN:249250621 @SQ SN:2 LN:243199373 @SQ SN:3 LN:198022430 @SQ SN:4 LN:191154276 @SQ SN:5 LN:180915260 @SQ SN:6 LN:171115067 @SQ SN:7 LN:159138663 @SQ SN:8 LN:146364022 @SQ SN:9 LN:141213431 @SQ SN:10 LN:135534747 @SQ SN:11 LN:135006516 @SQ SN:12 LN:133851895 @SQ SN:13 LN:115169878 @SQ SN:14 LN:107349540 @SQ SN:15 LN:102531392 @SQ SN:16 LN:90354753 @SQ SN:17 LN:81195210 @SQ SN:18 LN:78077248 @SQ SN:19 LN:59128983 @SQ SN:20 LN:63025520 @SQ SN:21 LN:48129895 @SQ SN:22 LN:51304566 @SQ SN:X LN:155270560 @SQ SN:Y LN:59373566 @SQ SN:MT LN:16569 @SQ SN:GL000207.1 LN:4262 @SQ SN:GL000226.1 LN:15008 @SQ SN:GL000229.1 LN:19913 @SQ SN:GL000231.1 LN:27386 @SQ SN:GL000210.1 LN:27682 @SQ SN:GL000239.1 LN:33824 @SQ SN:GL000235.1 LN:34474 @SQ SN:GL000201.1 LN:36148 @SQ SN:GL000247.1 LN:36422 @SQ SN:GL000245.1 LN:36651 @SQ SN:GL000197.1 LN:37175 @SQ SN:GL000203.1 LN:37498 @SQ SN:GL000246.1 LN:38154 @SQ SN:GL000249.1 LN:38502 @SQ SN:GL000196.1 LN:38914 @SQ SN:GL000248.1 LN:39786 @SQ SN:GL000244.1 LN:39929 @SQ SN:GL000238.1 LN:39939 @SQ SN:GL000202.1 LN:40103 @SQ SN:GL000234.1 LN:40531 @SQ SN:GL000232.1 LN:40652 @SQ SN:GL000206.1 LN:41001 @SQ SN:GL000240.1 LN:41933 @SQ SN:GL000236.1 LN:41934 @SQ SN:GL000241.1 LN:42152 @SQ SN:GL000243.1 LN:43341 @SQ SN:GL000242.1 LN:43523 @SQ SN:GL000230.1 LN:43691 @SQ SN:GL000237.1 LN:45867 @SQ SN:GL000233.1 LN:45941 @SQ SN:GL000204.1 LN:81310 @SQ SN:GL000198.1 LN:90085 @SQ SN:GL000208.1 LN:92689 @SQ SN:GL000191.1 LN:106433 @SQ SN:GL000227.1 LN:128374 @SQ SN:GL000228.1 LN:129120 @SQ SN:GL000214.1 LN:137718 @SQ SN:GL000221.1 LN:155397 @SQ SN:GL000209.1 LN:159169 @SQ SN:GL000218.1 LN:161147 @SQ SN:GL000220.1 LN:161802 @SQ SN:GL000213.1 LN:164239 @SQ SN:GL000211.1 LN:166566 @SQ SN:GL000199.1 LN:169874 @SQ SN:GL000217.1 LN:172149 @SQ SN:GL000216.1 LN:172294 @SQ SN:GL000215.1 LN:172545 @SQ SN:GL000205.1 LN:174588 @SQ SN:GL000219.1 LN:179198 @SQ SN:GL000224.1 LN:179693 @SQ SN:GL000223.1 LN:180455 @SQ SN:GL000195.1 LN:182896 @SQ SN:GL000212.1 LN:186858 @SQ SN:GL000222.1 LN:186861 @SQ SN:GL000200.1 LN:187035 @SQ SN:GL000193.1 LN:189789 @SQ SN:GL000194.1 LN:191469 @SQ SN:GL000225.1 LN:211173 @SQ SN:GL000192.1 LN:547496 @SQ SN:NC_007605 LN:171823 @RG ID:50302428 PL:ILLUMINA LB:50302428_541 SM:50302428 @PG ID:MarkDuplicates PN:MarkDuplicates VN:1.738(86a30760afd8c3002421e207b7557896544a3805_1406042774) CL:picard.sam.MarkDuplicates INPUT=[bam/50302428_sorted.bam] OUTPUT=bam/50302428.bam METRICS_FILE=/dev/null REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=LENIENT PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false ``` And then it says toward the bottom that the file has already been sorted and marked with duplicates as follows: ``` CL:picard.sam.MarkDuplicates INPUT=[bam/50302428_sorted.bam] OUTPUT=bam/50302428.bam ``` In that case I think I've answered my own question. I also figured out how to remove the PCR duplicates afterward and still retain a fairly big file size after by using the following command: ``` samtools rmdup -s R#######.bam R#######.nd.bam ``` If Drs. Xu or Klein can add anymore information I need to know regarding these (?) .bam files before proceeding with using others peak calling tools, then that would be great. Otherwise, thank you for your reply and have a good rest of your week. -Phoebe
Hi there, I unfortunately don't have more details on the ROSMAP ChIPseq data, but hopefully someone from the Rush or Columbia teams can help. I see @xujishu uploaded the bam files originally, and @haklein's manuscript is cited in the reference. Drs. Xu or Klein, do you have any additional information you can share to help Pheobe with this ChIPseq data? Thanks, Abby

Clarification needed for syn5958425 BAM files page is loading…