Is it permissible to use sequence conservation information when deriving motifs or in any way as part of the prediction method?
Created by John Reid Epimetheus For the purposes of this challenges, we restricted input modalities to be restricted to properties of the genome and the TF i.e. properties that actually contribute in a physical way to TF binding. Conservation scores are virtual and not a physical property of the genome or the TF. Its most certainly a predictive feature but for this challenge we are leaving it out. As noted above, you can use it in whatever you like to derive sequence features but not directly as an input modality. What's the rationale for not being allowed to use conservation scores like GERP and PhastCons directly? An example of direct use would be using the numeric sequence conservation scores (e.g. GERP or PhastCons scores) as direct features in a predictive model. An indirect use would be deriving PWMs or some other motif representation that in some way uses conservation scores, but the features going into the model are the PWM/motif match scores (or some transformation of these scores) to the sequence. The latter indirect use would be fine since the primary representation being used in the model is a sequence pattern based feature. The former direct use would not be allowed. Let me know if that clarifies things.
Thanks,
Anshul. I am a little confused about the sequence conservation part. Let's say we use multiple species alignments to learn motifs. Since these motifs will be used in the predictive model, this would mean that sequence conservation is also getting used. How do you define "direct use" here? Can you give a specific example of what is allowed and what is not? You can use sequence conservation to derive sequence motifs. But you cannot use conservation directly in your predictive model since this is not a supported data type for the purposes of this challenge.
Thanks,
Anshul.