Deep Learning model too large #15516
Unanswered
hasithjp
asked this question in
Technical Notes
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem
H2O Deep Learning triggers an internal limitation of H2O on the max. size of an object in the distributed K-V store (that is the core of H2O). This limit is 256MB, and once the DL model hits that size, this condition occurs. The reason is that the Deep Learning model is currently stored as one large piece, instead of splitting it up into partial pieces. Cutting it into one piece per hidden layer won't solve this issue either, so we would have to cut a single matrix into multiple pieces to address this issue, which is somewhat cumbersome to implement. That said, a model of that size is also going to take a long time to train.
Note: The memory limit has nothing to do with the number of rows of the training data (just the # columns, as that affects the first hidden layer matrix size), nor the RAM or max. allowed heap memory (that is checked separately). It also has nothing to do with the number of nodes, threads, etc. It's purely a function of the model complexity, see the next section.
What affects the model size?
It's mainly the number of total weights and biases, multiplied by an overhead factor of x1, x2 or x3, depending on whether momentum_start==0 && momentum_stable==0 (x1), momentum > 0 (x2) or adaptive learning rate (x3) is used. Then there's some small overhead for model metrics, statistics, counters, etc.
The total weights is directly given by the fully connected layers:
The number of input columns (after automatic one-hot encoding of categoricals)
The size of the hidden layers
The number of output neurons (#classes)
Failing example (~25M floats * 3 for ADADELTA > 256MB)
Working example (~25M floats * 1 without ADADELTA and no momentum < 256MB)
Output:
java.lang.IllegalArgumentException: Model is too large
For more information visit:
http://jira.h2o.ai/browse/TN-5
at hex.deeplearning.DeepLearningModel.(DeepLearningModel.java:424)
at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:201)
at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:171)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1005)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
barrier onExCompletion for hex.deeplearning.DeepLearning$DeepLearningDriver@5205f0fd
Solution
The current solution is to reduce the number of hidden neurons, or to reduce the number of (especially categorical) features.
JIRA Issue Migration Info
Jira Issue: TN-5
Assignee: Arno Candel
Reporter: Arno Candel
State: Closed
Relates to: #13925
Beta Was this translation helpful? Give feedback.
All reactions