According to rules provided - "Models must be trained from scratch (no pre-training), in order to avoid overfitting to sequences present in the test data (e.g. some sequences in the test data are derived from extant yeast promoters)."
Is it allowed to use semisupervised approach? All participants have access to the test sequences and it seems reasonable to give the model some information about the "sequences space".
Created by Dmitry Penzar penzard Hello @mtinti, sorry somehow I missed this.
No that would not be allowed. Hi @muntakimrafi,
I'd like to try out external programs that compute features on the training data.
For example, using the melting temperature for a window of bps. Would this be allowed? Thanks indeed, much clearer now! Thanks a lot for your answer!
This is really helpful and clearly defines rules for accepting the model
Our intent is to come up with the best model architectures and the best ways to train models on sequence-to-expression tasks that will benefit everyone in the community. The machine learning competitions are always won by ensembles (sometimes of 30/40/50 models). We want to avoid that happening.
1. We are allowing Random forest and gradient boosting methods to keep the option of using decision trees for participants. If someone wants to use k-mers and xgboost to beat deep neural networks, we want to keep the path open [In my opinion, RF/xgboost should never win over neural networks on this dataset]. Also, these methods are by nature ensemble.
2. We are aware of the fact that the participants could incorporate layers or paths in their neural network architecture that can simulate an ensemble to some extent. But if you are coming up with such architectural choices and are able to train them end-to-end, it can be thought of as a novel architecture/strategy. Building something like InceptionNet, and adding dropouts, are allowed. If everyone proposes ResNet, and you come and show Inception ResNet V2 is better, you are providing insight into the problem. You are showing everyone how the residual connections should be designed for sequence-to-expression tasks.
3. What we strictly prohibit is training a bunch of different models and averaging their predictions. You should not train Resnet50, Resnet101, etc, and then average them during prediction.
4. One of the reasons we did not want to specify the validation set is because you could design deep networks that easily overfit the training data. Then you need a bigger validation set to be confident about your best weight. But if you see that your model is not overfitting at all, you could require fewer validation data.
Please let me know if you have any confusion. 1. If you can think of ways to augment the provided **training data**, it's okay.
2. You must not use the **test data** to introduce any form of augmentation. The model must not use the test data for any training purpose.
3. Pseudo-labeling the provided **test** data is not allowed. The model must not use the test data for any training purpose.
4. Averaging cross-fold models would count as ensembling. It doesn't provide any significant insight in terms of the novelty of model architecture.
5. Test time augmentation is allowed.
>>> I can't remember organizers saying data augmentation is prohibited.
around minutes 42 of webinar...
I think the main aim of the organizer is to focus on and model architecture and data encoding, using one model for data in and prediction out.
Under this philosophy, all of my points (and yours) should be forbidden.
However, it would be useful to state it clearly. For example:
The model should use an X% of provided data to train and an X% of provided data to evaluate, and encode it without augmentation. only one trained model can be used for the prediction. I can't remember organizers saying data augmentation is prohibited.
Yes, It would be really helpfull. Especially when terminology is not clear - e.g, organizers allow to use RF and boosting models, although they are ensembles in their very nature. At the same time RF-like neural networks ensemble is prohibited. Well-known fact is that dropout IS a form of neural network ensemble - and it IS allowed - the organizers baseline use it....
Is it OK to use teacher-student approach for noise reduction? And etc, etc, etc.... Actually, it would be helpful to have a short list of do and don't in rules page, such as :
Ensemble of models - No (already answered in webinar/rules pages)
Data Augmentation - No (already answered in webinar/rules pages)
Pre-training - No (already answered in webinar/rules pages)
pseudo labeling ?
cross-fold models and averaging cross-fold predictions?
time test augmentation ?