Hi all, I want to kick off a thread that's always quite popular on Kaggle on models' performances. So far, I have been playing with lightGBM and NN using a simple 70/30 train/test split. my lightGBM models have Pearson correlation coefficients around 0.65 and my NNs around 0.7 (test evaluation). Also, I'm not able to overfit the train data and have some difficulties with my NNs gradients that tend to disappear after a few epochs. Have you found similar issues?

Created by Michele Tinti mtinti
Thanks!
btw in the following code snippet, I think you need to use max instead of auto (taken from the [training code](https://github.com/1edv/evolution/blob/master/manuscript_code/model/tpu_model/train_model.ipynb)) ``` checkpoint = keras.callbacks.ModelCheckpoint(model_path, monitor='val_r_square', verbose=1, save_best_only=True, mode='auto') # check 5 epochs early_stop = keras.callbacks.EarlyStopping(monitor='val_r_square', patience=10, mode='auto') ```
Hello @mtinti , I used a fixed learning rate (as mentioned in the paper's GitHub repo). I stopped training when I hit the early stop. I remember changing some stuff as I used tensorflow 2.8 (nothing major). If you are getting nans during training something must have gone wrong in the model architecture code or the way you are training it. I don't think the problem you are facing has to do anything with the amount of data. However, I did not train it on any TPU VM. I did my training on a GPU / multiple GPUs.
Hi @muntakimrafi, if you dont mind sharing, how was the traing of the Vaishnav et al. 2022 model, in particular, did you use any early stop or lerning rate scheduler? did you manage to train nicely to the end? I'm asking as i'm plague by nan during traing. I never hit my early stop and practically, when I hit nan, I run the epoch again (and somethimes it goes well) or I decrease the learning rate and train a little more. I was curios to understand if it is beacouse my model architeture or becose the huge amount of training data (or a mix of both :-) cheers Michele
@Szrack you are right. The thing I was referring to is from [here](https://github.com/1edv/evolution/blob/e99960569b65601293a174ca99a4466e968a0154/manuscript_code/model/tpu_model/rr_aux.py#L346) because of the competition that can occur between the forward and the reverse strand, it is reasonable to provide the reverse strand context to the network. To do that, if we send the reverse strand as a separate input to the network and learn filters for them separately that doesn't make much sense. Why not learn the same set of filters? [because the filters can be thought of as TF motifs and the same TFs will interact with both forward and reverse strands] After you scan the forward and reverse strand with motifs, you can perform operations on them where u treat them differently (it's not equivariant afterward). You would not want to create a completely strand invariant model since there can be different parameters learned for top and bottom strands.
> You could also make the network invariant to forward or reverse strand Thanks @muntakimrafi, interesting reading! > if the promoter is transcribing in the reverse direction, > it won't contribute positively to the transcription rate measurement. Thanks @Szrack for pointing this out, I did't think about the possibility of transcription starting on the reverse strand....
Just FYI: if the promoter is transcribing in the reverse direction, it won't contribute positively to the transcription rate measurement. If the RNAP initiation complex assembles and initiates on the reverse strand, the mRNA that is produced is not measured by the massively parallel reporter assay and the competition for binding could actually prevent assembly & initiation on the forward strand, leading to lower transcription rates. Hope that helps!
@mtinti You could also make the network invariant to forward or reverse strand. There are several ways to do it. Vaishnav et al used this strategy in their transformer model [not proposed by them. it had been proposed before]. take a look at this Kundaje lab paper https://proceedings.mlr.press/v165/zhou22a.html
ye, it looks like is not generalizing properly and just predicting the mean for a bunch of promoters (at least in my case). There is also another interesting issue. I was training a model that would see the promoter 50% of the time in forward and 50% of the time in reverse orientation. Then I would predict the test promoters in forward and in reverse orientation (sort of test time augmentation). You can see the results in panel C) of this figure. The predictions on the forward and reverse strands are similar, but the promoters predicted with just the mean values are different. https://raw.githubusercontent.com/mtinti/DREAMchallenge/main/plot_f_r.png Panel A) and B) are just scatter plots of the forward and reverse strand predictions with the target. Now I'm trying to think of an NN architecture that could exploit/mitigate this strand dependent effect.
Hi @mtinti , I did not look into the details of the training data, but I think this could be the case that the training data is not distributed equally. Therefore, our trained model may have a preference when predicting over the test data. Ideally, if the model is generalized properly, we should not observe such horizontal lines. Do you think so? I think I am the only person who can view the plot...In my plot, I can also observe horizontal lines between predicted data and true data.
I don't know if I'm imagining things, but I think I can see it also in the host Transformer Neural Network predictions (slide 30 of the webinar)... by the way, for some reason, I cant see your image, it says: You lack READ access to the requested entity.
Hi @mtinti, Yes, similar pattern can be observed from my prediction on HighQuality.pTpA.Glu.test.txt ${imageLink?synapseId=syn31298386&align=None&scale=100&responsive=true&altText=}
Hi @Szrack, if we train on a different scale of y, comparing the MSE is not going to help much... Pearson here is a good metric to compare between our models.
@FreakingPotato, does your prediction on HighQuality.pTpA.Glu.test.txt have this sort of strange density, such as the one highlighted by the horizontal line, around the prediction mean? https://github.com/mtinti/DREAMchallenge/blob/main/HighQuality.pTpA.Glu.test.png?raw=true I noticed this because my predictions on the competition test set seem to have it as well https://github.com/mtinti/DREAMchallenge/blob/main/pred_bins.png?raw=true
"close enough :-D did you use the standard scaler (x-mean / std)?" @mtinti Thanks, looks like my scaler is not standard enough :) "What about MAEs and MSEs? Would anyone like to share theirs? My loss fcn is MSE." Hi @Szrack, my root mean square error is 1.62 on train dataset(with 0.7,0.15,0.15 split)
>>> Not sure why they are not in exact square relationship close enough :-D did you use the standard scaler (x-mean / std)?
What about MAEs and MSEs? Would anyone like to share theirs? My loss fcn is MSE.
oh yes, I forgot they are not in the same scale...lol However, after scaling the correlation score is 0.937(using np.corrcoef) and R_Square score is 0.837(tfa.metrics.r_square.RSquare()). Not sure why they are not in exact square relationship.
Do you have preds and targets on the same scale? Did that mistake once...
Hi Michele, I quickly tested the dataset you mentioned to see my model's performance. Surprisingly, I got a Pearson Correlation Coefficient 0f 0.937 but a R_Square score of 0.257. @mtinti Can you try to calculate the R_square score with your model in this new test dataset? Cheers, Ke
>>>Cannot wait for the release of the public leaderboard >>> to see our model's performance over the test dataset. @FreakingPotato I was desperate to see my model performance as well, so I downloaded this file: 'HighQuality.pTpA.Glu.test.txt' from the organizer's GitHub account https://github.com/Carldeboer/CisRegModels. After prediction on this file, I get a Pearson correlation coefficient of 0.94 https://raw.githubusercontent.com/mtinti/DREAMchallenge/main/HighQuality.pTpA.Glu.test.png
Thanks for this clarification!
@mtinti I think Pearson r 0.7 makes sense. I was confused as you said you got 0.9. Yes, that is correct. I got it from a run where I just followed the steps proposed in Vaishnav et al 2022. The way I think of it is that, if you add random 20% noise to the labels of the MNIST digits dataset, your train and validation accuracy will be capped at 82% during training. But the model could have 98-100% accuracy on the test dataset.
hi @muntakimrafi, thanks for the clarifications. >>>Are you saying you got an r^2 of 0.9 between the >>> predictions of the app and your current model, the >>>predictions of the app and the challenge training data, >>> or something else? I meant : 10000 random promoters from this challenge training data and the app predictions on the same 10000 random promoters (using the 'Expression' column) have a Pearson CC of 0.7. One final thing, correct me if I'm wrong... are you saying that: The Transformer Neural Network has a r2 of 0.53 (evaluated in some kind of train/valid split on this challenge Train data), and a Pearson CC on this challenge Test data > 0.9 ( r2 > 0.81) ?
@muntakimrafi Thanks for your explanation! -- "You should be able to get more than 0.5." Yes, that explains why I could not see much improvement by tuning the model arch only on the training dataset. Training on the whole dataset, the best validation r-square I got is at 0.53. --"I think this will be clearer once we set up the public leaderboard where we will use a portion of the test data" Cannot wait for the release of the public leaderboard to see our model's performance over the test dataset.
The validation r_square was around 0.52/0.53 ish. One important thing to keep in mind is that the test data is a different experiment with substantially less noise than the training data. By seeing a low score on the training data you should not be discouraged. You just need to make sure you are not overfitting to the noise in the data and learning generalized rules. I think this will be clearer once we set up the public leaderboard where we will use a portion of the test data (expression measured in a different experiment with substantially less noise). @mtinti "To see how the organizer's model performed on our train data, I got predictions of 10000 random promoters using the web implementation available here (https://evolution-app-vbxxkl6a7a-ue.a.run.app/) and it's anything near 0.9." Are you saying you got an r^2 of 0.9 between the predictions of the app and your current model, the predictions of the app and the challenge training data, or something else? The first would make sense, but the second would be very surprising since there is substantial noise in the training data provided for this challenge. @FreakingPotato "Yes, I also tested the organizer's model on the competition data (subset:200,000) with R^2 around 0.4." You should be able to get more than 0.5.
Wow, ok, thanks for letting us know. If you don't mind sharing, can you tell us something about the validation score of the model (Transformer Neural Network) during training?
Hello everyone, amazing discussion going on here. Feels like kaggle! btw the results that were shared during the presentation were performances on the test data of this competition. The models were trained from scratch on the training data everybody is using.
I hope the organiser will re-train their model on this dataset to be fairer to us :-D
Yes, I also tested the organizer's model on the competition data (subset:200,000) with R^2 around 0.4. I think we won't be able to reach 0.9 Pearson easily since the competition data is noisier.
One thing that puzzles me is the performance of the organiser's model (Transformer Neural Network) on the test dataset, above 0.9 Pearson, as shown in slide 30 of the presentation (https://drive.google.com/file/d/150eHgy-x3R9ZMBENoDT6zfq4KLgKcmfn/view?usp=sharing). To see how the organizer's model performed on our train data, I got predictions of 10000 random promoters using the web implementation available here (https://evolution-app-vbxxkl6a7a-ue.a.run.app/) and it's anything near 0.9. I suppose because of different data collection strategies and/or growth conditions of the yeast strains. So I'm inclined now to believe that by using this dataset, we will never get to the level of the organizer's model. I'd love to be proved wrong anyway....
Hi Michele, Thank you for your reply. I agree that the square of Pearson is equal to the coefficient of determination in this case. We are getting similar performance at the moment. I am using one-hot encoding because I think different CNN filter sizes will act as k-mers encoding. I am currently testing multiple transformer blocks but did not see much improvement. Maybe more work should be done on data argumentation. Cheers, Ke
Hi Ke, Thanks for the feedback! So we can say that we have approximately similar performances (sqrt 0.5 r2 = 0.71 Pearson cc); what do you think? I'm just going with Pearson as it will be the main evaluation metric as far as I have understood. I'm around these values (0.7s) independently of the NN architecture (LSTM/GRU or LSTM/GRU + attention) and encoding (embedding on ACGTN or onehot). Just yesterday, I tried a kmer approach encoding with kmer of 4,5,6 and 7 in lengths. I see that I can overfit my training data with kmer of length 7. The Train Pearson cc goes to 0.86, but the validation is very bad (0.04). As soon as I introduce some dropout regularisation, I'm back to 0.7 Pearson cc in train and test. Cheers Michele
Hi mtinti, It is a good idea to share the model's performance, so we can discuss how to improve it. I don't think the Pearson correlation coefficient is the right metric to evaluate the regression model's performance. Maybe R^2 (coefficient of determination) will better describe your model's performance in this case. Currently, I can achieve R^2 slightly above 0.5 in 70/15/15 data split with naive CNN. Cheers, Ke

Sharing Model Performance page is loading…