I would like some clarification on the rules towards using known motifs from motif databases. In one section of the rules it said that this is permitted.
However, it seems a bit silly to me to allow this when it seems one of the goals of writing such a learning algorithm would be to learn motifs.
IE: Could I use CTCF motif data derived from Chip-exo data and use that as a parameter for predicting binding for the CTCF Chip peaks in the datasets provided?
Could I, for example, use known CTCF motifs to try to get more predictive power for the other non-CTCF transcription factors?
It seems to me the motifs of certain transcription factors, particularly pioneer factors or insulators, would possibly be useful for predicting the binding of other transcription factors, but in a way that is probably going to be different for different transcription factors.
Created by Lawrence Du LawrenceDu The goal of the challenge is not to learn motifs. The goal of the challenge is to predict binding. You can most certainly predict binding without explicitly learning motifs (take for example a support vector machine with a string kernel .... it never learns a motif explicitly but works directly off the raw sequence k-mer representations). We allow open use of any motifs learned in any way because learning motifs is not the objective function. Regarding your specific questions.
>> Could I use CTCF motif data derived from Chip-exo data and use that as a parameter for predicting binding for the CTCF Chip peaks in the datasets provided?
Yes. You can. As long as you are using the motif itself and not the Chip-exo data as features in the learning model.
>> Could I, for example, use known CTCF motifs to try to get more predictive power for the other non-CTCF transcription factors?
Absolutely. You will almost surely get better performance in predicting a target factor by including sequence features of relevant co-factors. How you decide on relevant factors and their motifs is what your learning method needs to figure out.
>>It seems to me the motifs of certain transcription factors, particularly pioneer factors or insulators, would possibly be useful for predicting the binding of other transcription factors, but in a way that is probably going to be different for different transcription factors.
Yes absolutely. And this is the problem you need to solve. In-vivo, the primary motif of the target factor by itself is not sufficient at all to predict a cell-type specific binding event. You need to be able to integrate a variety of features including the ones you mention to have reasonable performance.
You are allowed to all of the above. There are no restrictions on which motifs, how many motifs and where you got the motifs from (as long as you are able to provide them and cite or provide the method that derives these motifs). And there is no reason to use motifs at all. You can learn other sequence representations. You are free to use any sequence representation you like (motifs being one of them).
-Anshul.