I downloaded the training files and converted them to bedGraph format as suggested. However it looks to me like this is bad advice. The bedGraph files do not represent vectors of values and have different resolutions. I've read about the wig format and that looks to me much more appropriate. Essentially there is one value per interval with a constant interval size. I'm going to have a go at converting to wig instead . Am I missing something?
Created by Dave Curtis DaveCurtis Yes that's correct. The bigwigs are at 1 bp resolution. Reading them using pybigwig will be most efficient. Or bedgraph (it's more compressed than wig) that you then convert into a vector by filling in the values corresponding to the intervals into the chromosome sized vector.
You will report the imputed tracks at 25 bp resolution using the template file provided. OK, I guess I understand. You're providing the bedgraph data and it's up to us to convert it into a vector of whatever resolution we want. Then when we come to submit we have to produce a vector at 25 bp resolution which fits into the submission template:
[rejudcu@comic2 links]$ head submission_template.bedgraph
chr1 0 25 0.0
chr1 25 50 0.0
chr1 50 75 0.0
chr1 75 100 0.0
chr1 100 125 0.0
chr1 125 150 0.0
chr1 150 175 0.0
chr1 175 200 0.0
chr1 200 225 0.0
chr1 225 250 0.0 Actually I'm wrong, the wig format is no better.
This is what I get:
[rejudcu@comic2 wig]$ head C01M16.wig
#bedGraph section chr1:0-905223
chr1 0 17350 0.22658
chr1 17350 17632 0.64287
chr1 17632 56900 0.22658
chr1 56900 57054 0.20885
chr1 57054 86935 0.22658
chr1 86935 87100 0.64287
chr1 87100 100502 0.22658
chr1 100502 100667 0.64287
chr1 100667 115720 0.22658
This is not the way the data is described in the challenge overview, which suggests we should get a vector at 25 bp resolution. The sectors are all different lengths and are not multiples of 25. Also, they differ between files:
[rejudcu@comic2 wig]$ head C02M01.wig
#bedGraph section chr1:0-629152
chr1 0 10434 0.00168
chr1 10434 10437 0.01114
chr1 10437 10454 0.03857
chr1 10454 10458 0.09444
chr1 10458 10584 0.18633
chr1 10584 10587 0.09444
chr1 10587 10604 0.03857
chr1 10604 10608 0.01114
chr1 10608 15997 0.00168
It seems to me that to get this into reasonable shape I have to read in these files and output a value for every 25 bp interval. I can do that (in fact I already started writing the code). But it isn't what I was expecting.
Also, when you score the predictions either you will have to do this interpolation as well or we will have to provide the correct values for the intervals specified in the validation and test datasets.
I hope you understand my concerns.