I've been running into issues getting NaN's during training and was wondering if anyone was having similar issues or if there was potentially something wrong with the machine my jobs run on. I've tried a lot of different things, including changing learning rates, different regularizers, gradient clipping, etc and it might run for a little while but then eventually always turns to NaNs. If I run the same docker image locally (on a TitanX or 980-Ti) I never run into the issue. It can't be an issue with the data either because if I run with random inputs, it still happens. I'm at a loss for what could be causing it. I'm not sure if it's possible, but could there potentially be an issue with the particular GPU on the machine my stuff runs on? If it helps, I'm using theano with keras. Thanks in advance!
Created by Bill Lotter bill_lotter Hi Thomas,
I was finally able to get it working today. Still have no idea what was happening but I changed a bunch of versions around and it looks like it's working at the moment.
Thanks,
Bill Dear Bill,
Apologies for the delay in response. Has your issue been resolved?
Best,
Thomas