I was wondering what methods other folks are using to retrieve labels from the .tsv ChIP files.
Loading into memory and iterating is quite slow, and the natural thing to do would be to index the file in some way.
I was thinking conversion to the BigBed format would be the easiest way to do this?
Created by Lawrence Du LawrenceDu Thanks! tabix is the way to go. Here is a script to reformat all the gzipped chip tsv files into tabix compatible bgzipped tsv files:
```
#!/bin/bash
#Requires tabix and bgzip installed!
tsv_file_dir=$HOME/labels_dir/*.gz
for filename in $tsv_file_dir*.gz; do
echo "Decompressing $filename with overwrite."
gzip -df $filename
#sort -k1,1 -k2,2n $filename > $filename
done
for filename in $tsv_file_dir*.tsv; do
echo "Sorting and converting $filename to .bgz format"
bgzip -c $filename > $filename.bgz
tabix -p vcf $filename.bgz
done
```
You could use bigBed or even better would be tabix http://www.htslib.org/doc/tabix.html which can be used for any generic tab delimited genomic files.
-Anshul.
Drop files to upload
Indexing ChIP tsv label files for rapid retrieval page is loading…