-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when testing on 100k dataset #18
Comments
Hmm weird, can you figure out which minibatch/image triggered the error? E.g., by printing image names and trial and error. |
@da03 So I printed image names in each batch, looks like "75dc30bf82.png" is an empty image and cause the acc=NAN. Also from the log file I attached, you can see the batch that causes the error (I checked the token size before it crashes). I also used CUDA_LAUNCH_BLOCKING=1 to get a better description of the error: { I checked the two samples that might cause this, still confused. |
Hmm acc is not a big issue here since we only used val ppl to select models and use another evaluation script afterwards to calculate image accuracy. For blank images i suspect there's a zero divided by zero issue which caused nan's. For the more serious error, it looks like some problem with lookuptable? There could be two places where we used lookuptable: positional embedding and decoder word embedding. Can you check the code to pinpoint whether the issue occurred during encoder forward pass or during decoder forward pass? |
BTW, how did you get the test dataset? It's weird since I've fully tested that with this code. Here is my processed dataset: http://lstm.seas.harvard.edu/latex/data/ (the processed section) |
@da03 I downloaded the data from the link provided in the repo: https://zenodo.org/record/56198#.XFhyiXXwYph Thanks for providing the processed data, I get the same issue on some of the inputs from these as well (same pattern). I should be able to finish it after some cleaning. |
Hi, @mahshad92, did you find a way to overcome this issue? I faced the same problem. |
I resolved the problem by retrain the model on my own device. |
Hi,
I am trying to replicate your results. Although I have no issues loading the trained model and test it on the toy test samples (100), when I try to use the same model to get the accuracy on all test samples in the 100K dataset(10355) the test accuracy becomes NAN after some time and I get an error after 2000 samples. I do not understand this behavior. I changed the token length to get rid of warnings, but that is no help. Please let me know if you faced the same issue.
log.txt
[01/27/19 17:43:52] 1.046239
[01/27/19 17:43:52] Number of samples 2000 - Accuracy = nan
[01/27/19 17:43:54] 1.082996
[01/27/19 17:43:58] 1.228099
[01/27/19 17:44:00] 1.140648
[01/27/19 17:44:03] 1.131666
[01/27/19 17:44:06] 1.043551
[01/27/19 17:44:09] 1.162436
[01/27/19 17:44:11] 1.087319
[01/27/19 17:44:14] 1.575318
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=163 error=59 : device-side assert triggered
/home/mxm7832/torch/install/bin/luajit: /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:163
stack traceback:
[C]: in function 'v'
/home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Sigmoid_updateOutput'
/home/mxm7832/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'func'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
src/model/model.lua:360: in function 'feval'
src/model/model.lua:885: in function 'step'
src/train.lua:111: in function 'train'
src/train.lua:289: in function 'main'
src/train.lua:295: in main chunk
[C]: in function 'dofile'
...7832/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
The text was updated successfully, but these errors were encountered: