Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when testing on 100k dataset #18

Open
mahshad92 opened this issue Jan 28, 2019 · 7 comments
Open

Error when testing on 100k dataset #18

mahshad92 opened this issue Jan 28, 2019 · 7 comments

Comments

@mahshad92
Copy link

mahshad92 commented Jan 28, 2019

Hi,

I am trying to replicate your results. Although I have no issues loading the trained model and test it on the toy test samples (100), when I try to use the same model to get the accuracy on all test samples in the 100K dataset(10355) the test accuracy becomes NAN after some time and I get an error after 2000 samples. I do not understand this behavior. I changed the token length to get rid of warnings, but that is no help. Please let me know if you faced the same issue.
log.txt

[01/27/19 17:43:52] 1.046239
[01/27/19 17:43:52] Number of samples 2000 - Accuracy = nan
[01/27/19 17:43:54] 1.082996
[01/27/19 17:43:58] 1.228099
[01/27/19 17:44:00] 1.140648
[01/27/19 17:44:03] 1.131666
[01/27/19 17:44:06] 1.043551
[01/27/19 17:44:09] 1.162436
[01/27/19 17:44:11] 1.087319
[01/27/19 17:44:14] 1.575318
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=163 error=59 : device-side assert triggered
/home/mxm7832/torch/install/bin/luajit: /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:163
stack traceback:
[C]: in function 'v'
/home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Sigmoid_updateOutput'
/home/mxm7832/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'func'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
src/model/model.lua:360: in function 'feval'
src/model/model.lua:885: in function 'step'
src/train.lua:111: in function 'train'
src/train.lua:289: in function 'main'
src/train.lua:295: in main chunk
[C]: in function 'dofile'
...7832/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

@da03
Copy link
Collaborator

da03 commented Jan 28, 2019

Hmm weird, can you figure out which minibatch/image triggered the error? E.g., by printing image names and trial and error.

@mahshad92
Copy link
Author

mahshad92 commented Jan 29, 2019

@da03 So I printed image names in each batch, looks like "75dc30bf82.png" is an empty image and cause the acc=NAN.

Also from the log file I attached, you can see the batch that causes the error (I checked the token size before it crashes). I also used CUDA_LAUNCH_BLOCKING=1 to get a better description of the error:

{
1 : "2b80174519.png"
2 : "5712d3adfe.png"
3 : "1aad846709.png"
4 : "1380c58267.png" ---> token length 391
5 : "7d032bac62.png" ---> token length 249
}
/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [3,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.

I checked the two samples that might cause this, still confused.

nan_log.txt
mainError_log.txt

@mahshad92 mahshad92 reopened this Jan 30, 2019
@da03
Copy link
Collaborator

da03 commented Jan 30, 2019

Hmm acc is not a big issue here since we only used val ppl to select models and use another evaluation script afterwards to calculate image accuracy. For blank images i suspect there's a zero divided by zero issue which caused nan's.

For the more serious error, it looks like some problem with lookuptable? There could be two places where we used lookuptable: positional embedding and decoder word embedding. Can you check the code to pinpoint whether the issue occurred during encoder forward pass or during decoder forward pass?

@da03
Copy link
Collaborator

da03 commented Jan 30, 2019

BTW, how did you get the test dataset? It's weird since I've fully tested that with this code. Here is my processed dataset: http://lstm.seas.harvard.edu/latex/data/ (the processed section)

@mahshad92
Copy link
Author

@da03 I downloaded the data from the link provided in the repo: https://zenodo.org/record/56198#.XFhyiXXwYph

Thanks for providing the processed data, I get the same issue on some of the inputs from these as well (same pattern). I should be able to finish it after some cleaning.

@pouyan-sh
Copy link

pouyan-sh commented Oct 7, 2021

Hi,

@mahshad92, did you find a way to overcome this issue? I faced the same problem.
@da03, can you help me with this issue? its a bit weird, because you said that you had tested the code with the dataset while it is raising error for me!( At first some nan for acc, and after few more steps it will raise an error like what @mahshad92 mentioned before)

@pouyan-sh
Copy link

I resolved the problem by retrain the model on my own device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants