Dear Organizers, I see in the data description that you recommend using the DNAShapeR tool for estimating DNA shape parameters. Would this and the other data restrictions preclude the use of the information in the [Dinucleotide Property Database](http://diprodb.fli-leibniz.de)? Thanks.

Created by Peter DeFord pdeford
Hi Jan, >>"An exception to this rule is the use of sequence features derived from DNA sequence. E.g. participants can use experimentally or computationally derived sequence motifs or other types of features derived from DNA sequence from any source." However, from my perspective, shape parameters from other sources are exactly that "other types of features derived from DNA sequence from any source." So could you please clarify, why these are not allowed? You are right in that the DNA shape features are derived from DNA sequence. What we mean by features derived from DNA sequence are sequence-based features that reflect the sequence affinity of factors. e.g. raw sequence, k-mers counts, known motifs, discovered motifs from the training data etc. We wanted to keep this open because there are many ways in which one can featurize DNA sequence to represent sequence affinity. While in-vitro DNA shape parameters are derived from DNA sequence, these would not be something that reflects sequence affinity but rather represent DNA shape. We just had to make a choice here and limit DNA shape features to popular ones that are used. Was there a specific DNA shape measure you wanted to use? >> In addition, I have two further, related questions: First, I am unsure if we are allowed to use gene coordinates and other annotations from GENCODE. These are linked to in section 4 of the Challenge Data Description, but it is stated nowhere explicitly that these may be used for learning models and making predictions. Yes you can certainly use coordinates, proximity to genes and any other information provided in the GENCODE annotations. These are provided and listed in the RNA data section. We consider gene annotations as part of the gene expression "data type". So anything listed in the data section can be used. >> Finally, may we also use motifs from other sources than those listed in the Challenge Resources? Again, I would consider this to be allowed given above excerpt from the Challenge Rules. However, I want to be sure as it appears that I already misinterpreted the rules with regard to shape parameters (or similar k-mer parameters from external sources). Yes you can use any sequence motif database (as long as you can share it or its already published and you can point to a reference). You can even learn motifs from the training data if you like. Like I said above we consider DNA shape features to not represent a way to featurize sequence itself but rather capture a different property of DNA i.e. shape which in this case does happen to be derived from sequence. Hope that clarifies things. Its hard to exactly and precisely define what can and cannot be used. Anything resource provided or listed in the Accessing data section can be used. Nothing outside that can be used except for features derived from sequence that represent affinity information e.g. motifs or k-mer counts etc. These are clearly arbitrary choices. But we need to force some level of consistency in the data types being used by participants so that models are reasonably comparable independent of inclusion or exclusion of specific data types. Thanks, Anshul.
Dear Anshul, dear organizers, it would be great if you could specify somewhere in even greater detail, which kind of (external) data are allowed and which are not. Currently, I am slightly puzzled in this regard. The Challenge Rules & Conditions state that "An exception to this rule is the use of sequence features derived from DNA sequence. E.g. participants can use experimentally or computationally derived sequence motifs or other types of features derived from DNA sequence from any source." However, from my perspective, shape parameters from other sources are exactly that "other types of features derived from DNA sequence from any source." So could you please clarify, why these are not allowed? In addition, I have two further, related questions: First, I am unsure if we are allowed to use gene coordinates and other annotations from GENCODE. These are linked to in section 4 of the Challenge Data Description, but it is stated nowhere explicitly that these may be used for learning models and making predictions. Finally, may we also use motifs from other sources than those listed in the Challenge Resources? Again, I would consider this to be allowed given above excerpt from the Challenge Rules. However, I want to be sure as it appears that I already misinterpreted the rules with regard to shape parameters (or similar k-mer parameters from external sources). Thanks a lot for your help and also for setting up this exciting challenge! Jan
Sounds good. Thanks for getting back to me! Peter
For the purposes of the challenge, please restrict to using only DNA shape properties provided by DNAShapeR (or precomputed tracks using the same tools). This is so that we can compare prediction methods systematically without being confounded by the use of different types of input data types. During the community phase, we will open it up to using any other kind of data that may be useful. Thanks, Anshul.

DNA Shape page is loading…