Error when testing on 100k dataset #18

mahshad92 · 2019-01-28T01:13:57Z

Hi,

I am trying to replicate your results. Although I have no issues loading the trained model and test it on the toy test samples (100), when I try to use the same model to get the accuracy on all test samples in the 100K dataset(10355) the test accuracy becomes NAN after some time and I get an error after 2000 samples. I do not understand this behavior. I changed the token length to get rid of warnings, but that is no help. Please let me know if you faced the same issue.
log.txt

[01/27/19 17:43:52] 1.046239
[01/27/19 17:43:52] Number of samples 2000 - Accuracy = nan
[01/27/19 17:43:54] 1.082996
[01/27/19 17:43:58] 1.228099
[01/27/19 17:44:00] 1.140648
[01/27/19 17:44:03] 1.131666
[01/27/19 17:44:06] 1.043551
[01/27/19 17:44:09] 1.162436
[01/27/19 17:44:11] 1.087319
[01/27/19 17:44:14] 1.575318
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu line=163 error=59 : device-side assert triggered
/home/mxm7832/torch/install/bin/luajit: /home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:163
stack traceback:
[C]: in function 'v'
/home/mxm7832/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Sigmoid_updateOutput'
/home/mxm7832/torch/install/share/lua/5.1/nn/Sigmoid.lua:4: in function 'func'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
.../mxm7832/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
src/model/model.lua:360: in function 'feval'
src/model/model.lua:885: in function 'step'
src/train.lua:111: in function 'train'
src/train.lua:289: in function 'main'
src/train.lua:295: in main chunk
[C]: in function 'dofile'
...7832/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

da03 · 2019-01-28T19:38:35Z

Hmm weird, can you figure out which minibatch/image triggered the error? E.g., by printing image names and trial and error.

mahshad92 · 2019-01-29T02:56:34Z

@da03 So I printed image names in each batch, looks like "75dc30bf82.png" is an empty image and cause the acc=NAN.

Also from the log file I attached, you can see the batch that causes the error (I checked the token size before it crashes). I also used CUDA_LAUNCH_BLOCKING=1 to get a better description of the error:

{
1 : "2b80174519.png"
2 : "5712d3adfe.png"
3 : "1aad846709.png"
4 : "1380c58267.png" ---> token length 391
5 : "7d032bac62.png" ---> token length 249
}
/tmp/luarocks_cutorch-scm-1-2331/cutorch/lib/THC/THCTensorIndex.cu:275: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [3,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed.

I checked the two samples that might cause this, still confused.

nan_log.txt
mainError_log.txt

da03 · 2019-01-30T23:48:01Z

Hmm acc is not a big issue here since we only used val ppl to select models and use another evaluation script afterwards to calculate image accuracy. For blank images i suspect there's a zero divided by zero issue which caused nan's.

For the more serious error, it looks like some problem with lookuptable? There could be two places where we used lookuptable: positional embedding and decoder word embedding. Can you check the code to pinpoint whether the issue occurred during encoder forward pass or during decoder forward pass?

da03 · 2019-01-30T23:49:54Z

BTW, how did you get the test dataset? It's weird since I've fully tested that with this code. Here is my processed dataset: http://lstm.seas.harvard.edu/latex/data/ (the processed section)

mahshad92 · 2019-02-04T17:15:10Z

@da03 I downloaded the data from the link provided in the repo: https://zenodo.org/record/56198#.XFhyiXXwYph

Thanks for providing the processed data, I get the same issue on some of the inputs from these as well (same pattern). I should be able to finish it after some cleaning.

pouyan-sh · 2021-10-07T17:06:37Z

Hi,

@mahshad92, did you find a way to overcome this issue? I faced the same problem.
@da03, can you help me with this issue? its a bit weird, because you said that you had tested the code with the dataset while it is raising error for me!( At first some nan for acc, and after few more steps it will raise an error like what @mahshad92 mentioned before)

pouyan-sh · 2021-10-20T08:07:38Z

I resolved the problem by retrain the model on my own device.

mahshad92 closed this as completed Jan 30, 2019

mahshad92 reopened this Jan 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when testing on 100k dataset #18

Error when testing on 100k dataset #18

mahshad92 commented Jan 28, 2019 •

edited

Loading

da03 commented Jan 28, 2019

mahshad92 commented Jan 29, 2019 •

edited

Loading

da03 commented Jan 30, 2019

da03 commented Jan 30, 2019

mahshad92 commented Feb 4, 2019

pouyan-sh commented Oct 7, 2021 •

edited

Loading

pouyan-sh commented Oct 20, 2021

Error when testing on 100k dataset #18

Error when testing on 100k dataset #18

Comments

mahshad92 commented Jan 28, 2019 • edited Loading

da03 commented Jan 28, 2019

mahshad92 commented Jan 29, 2019 • edited Loading

da03 commented Jan 30, 2019

da03 commented Jan 30, 2019

mahshad92 commented Feb 4, 2019

pouyan-sh commented Oct 7, 2021 • edited Loading

pouyan-sh commented Oct 20, 2021

mahshad92 commented Jan 28, 2019 •

edited

Loading

mahshad92 commented Jan 29, 2019 •

edited

Loading

pouyan-sh commented Oct 7, 2021 •

edited

Loading