Hi,
Just want to try some regression algorithms and I found this.
I used "bigWigAverageOverBed" command. And extract some of the "B" and "U" regions from "ChIPseq.GM12878.SPI1.fc.signal.train.bw".
I compared the mean signal value (mean0) reported by "bigWigAverageOverBed" to the "B"/"U" labels. Here's what I found:
mean0|label
15.6299|U
15.3636|U
15.287|U
0.62855|U
0.62855|U
0.62855|U
0.2346|B
0.4211|B
0.43715|B
131.822|B
147.208|B
150.127|B
The table above is just an example. The average value for "B" is 11 and for "U" is 0.5, which is good. But the first row, for example, is a very high signal (15.6299). Not sure why this one is labelled by "U"?
I'll try to give the genome coordinates later.
Thank you,
Yichao
Created by yichao li unfashionable The ChIP-seq peaks are called using the SPP peak caller that scores peaks based not just on fold-enrichment but also based on peak shape i.e. relative balance and shift between the positive and negative strand read counts around a purported peak summit. Each called peak region extends beyond the summit on either side to reflect the uncertainty of the summit which is a function of the fragment length. Also, note that the peaks are scored based on the shape-corrected, fold-enrichment of signal within peak regions and not based on fixed sized bins along the genome. And finally peaks are thresholded based on their reproducibility and signal strength across replicates and not just based on signal strength. So bin level average fold-enrichments are not going to exactly match scores of reproducible peaks. E.g. a peak with a higher overall fold-enrichment using reads pooled across replicates may have poorer reproducibility across replicates than another peak with lower pooled enrichment but better reproducibility. There are going to be some inconsistencies between the peaks called and the fold-enrichment.
Further, the fold-enrichment tracks are generated using MACS2 and dont include a peak shape penalty since they are not designed around the notion of a peak but rather simply provide an enrichment score per nucleotide. The bins that we label are at 200 bp resolution and they are not labeled as B, U, A based on the average fold-enrichment within the bin but rather based on their overlap with reproducible peaks. Since the bins are of fixed size they do not always perfectly capture the peak regions exactly. Bins flanking peaks may partially overlap the peak and contain some background signal. Such flanking bins can have effectively lower average fold-enrichments due to a partial overlap with a peak.
The prediction task we set up is not a regression task but rather a classification task. The fold-enrichment tracks are merely provided as a way to augment learning in whatever way you see fit. There is no reason they have to be 100% consistent because of the resolution differences of the bin labels and the fold-enrichment values. They are for sure largely consistent. We have experimented with various strategies to threshold peaks and label the genome and the strategy we have employed ends up being stable and does not require tons of manual fiddling compared to other options. Were we to model this as a regression task, we would have used a different strategy to score bins.
-Anshul. It is an interesting point. We had noticed something similar (TFBS predictions were sometimes "leading" the B windows a little, ie a window adjacent to B/A but labelled U would score high while the B window scored lower) but failed to analyse it further or take advantage of it! But your example is a bit more worrisome. If it is a fairly consistent pattern and not random variation, I'd be curious to know the reason.
Drop files to upload
Question about CHIP-seq Fold-enrichment signal coverage tracks page is loading…