Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

kmader · 2019-03-27T09:35:28Z

Performance on fairly simple 1D model with stacked branches / bottlenecks / inception blocks was surprisingly poor (>300x slower).

Keras on CPU (tensorflow lite for the same model is 170µs)

Using the 1 CPU/Thread trick

import tensorflow as tf, keras
session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
keras.backend.set_session(sess)
model = model_from_json(json_blob)

%%time
model.predict(np.zeros((1, 600, 3)));

CPU times: user 88.1 ms, sys: 35.1 ms, total: 123 ms
Wall time: 25.1 ms

Frugally Deep App

I realize the loop here includes loading and parsing the model, but that is a small fraction of the compute time (as can be seen by changing the length of the sequence)

!time ../build/main

Loading model
Loading json ... done. elapsed time: 0.192233 s
Running test 1 of 1 ... done. elapsed time: 1.472293 s
Loading, constructing, testing of ../model.json took 3.309670 s overall.
Running all zeros: 

real	0m31.150s
user	0m30.712s
sys	0m0.245s

Model Details

Model as Keras H5, FrugallyDeep JSON, and C++ file for inference: https://gist.github.com/kmader/135db41c5ea35c0dc8cae95ed90087f4

The text was updated successfully, but these errors were encountered:

kmader · 2019-03-27T09:59:16Z

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 3*8*8*8 as much work), but I haven't had time to dive into the code and figure out if that is really the case.

Dobiasd · 2019-03-27T12:30:23Z

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 388*8 as much work)

Good thinking, but that is not the case. Frugally-deep does cache the output of each node in the computational graph. :)

I just ran your code on my machine, and the output is as follows:

Loading model
Loading json ... done. elapsed time: 0.100226 s
Running test 1 of 1 ... done. elapsed time: 0.034413 s
Loading, constructing, testing of ./model.json took 0.250775 s overall.
Running all zeros: 

real    0m0.863s
user    0m0.742s
sys     0m0.116s

The first forward pass (the test one) took 0.034413 s.

So I would guess the reason for the bad performance you see is much more simple, e.g.:

You did not tell your compiler to produce optimized machine code.
You are using hardware from the 90s.
You are mining too much bitcoin in the background.

OK, jokes apart, it's very likely the first of the three above. Please check this part of the FAQ for more details: https://github.com/Dobiasd/frugally-deep/blob/master/FAQ.md#why-is-my-prediction-roughly-100-times-slower-in-c-as-in-python

kmader · 2019-03-27T12:57:39Z

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Loading model
Loading json ... done. elapsed time: 0.064939 s
Running test 1 of 1 ... done. elapsed time: 0.037276 s
Loading, constructing, testing of ../model.json took 0.270976 s overall.
Running all zeros: 

real	0m0.990s
user	0m0.797s
sys	0m0.186s

kmader · 2019-03-27T13:38:17Z

Running it on a vector sized 1, 1, 18000, 3 it seems to scale very differently (I guess the keras/tf runtime is mostly overhead while in frugally deep it is mostly compute, but is now a factor of 60

Loading model
Loading json ... done. elapsed time: 0.059423 s
Running test 1 of 1 ... done. elapsed time: 0.037828 s
Loading, constructing, testing of ../model.json took 0.263522 s overall.
Running all zeros: 

real	0m29.779s
user	0m20.428s
sys	0m7.359s

Keras Evaluation

%%time
_ = model.predict(np.random.uniform(size=(1, 3600, 3)))

CPU times: user 493 ms, sys: 211 ms, total: 705 ms
Wall time: 143 ms

Dobiasd · 2019-03-27T14:09:25Z

Sorry, I'm not sure I understand correctly.

So when using larger input vectors, you again have a performance problem with frugally-deep, or not?

Dobiasd · 2019-03-27T14:17:01Z

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Yeah, Frugally-deep and especially FunctionalPlus, which is used, rely heavily on compiler optimizations. I tried to write the code as readable as possible for humans. For this I avoided manual optimizations, but always checked (with profiling) if the compilers are able to do them for us. :)

kmader · 2019-03-27T14:22:22Z

Well on the summary of the very poorly conducted benchmarks above,

Vector Size	Keras	Frugally Deep
600	0.088	0.990
18000	0.493	29.779

As the data gets larger by a factor of 30, Keras takes 5x longer and Frugally Deep takes 30x longer. So there are some other things coming into play in tensorflow that let it handle larger datasets better.

Dobiasd · 2019-03-27T14:53:21Z

Huh, that is interesting.

I tested your model with the following code:

#include "fdeep/fdeep.hpp"
int main()
{
    const std::size_t test_runs = 3;
    const auto model = fdeep::load_model("model.json", true);
    fdeep::tensor5s inputs = {fdeep::tensor5(fdeep::shape5(1, 1, 1, 18000, 3), 0)};
    fplus::stopwatch stopwatch;
    for (std::size_t i = 0; i < test_runs; ++i)
    {
        std::cout << "Starting test run " << i << "." << std::endl;
        model.predict(inputs);
    }
    const double duration_avg = stopwatch.elapsed() / static_cast<double>(test_runs);
    std::cout << "Forward pass took " << duration_avg << " s on average." << std::endl;
}

and had this output:

Loading json ... done. elapsed time: 0.098866 s
Running test 1 of 1 ... done. elapsed time: 0.034203 s
Loading, constructing, testing of model.json took 0.254174 s overall.
Starting test run 0
Starting test run 1
Starting test run 2
Forward pass took 19.488 s on average.

So indeed, it is quite slow.

The equivalent Python code

from keras.models import load_model
import numpy as np
import datetime

test_runs = 3
model = load_model('out_model.h5')
data_in = np.zeros((1, 18000, 3))
start_time = datetime.datetime.now()
for i in range(test_runs):
    print('Starting test run {i}.')
    model.predict(data_in)
end_time = datetime.datetime.now()
duration_avg = (end_time - start_time) / test_runs
print('Forward pass took {} s on average.'.format(duration_avg))

ran with:

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 main.py

is much faster:

Starting test run 0.
Starting test run 1.
Starting test run 2.
Forward pass took 0:00:01.365790 s on average.

I'll see if I can find out what is going on there, and will get back to you with the results.

Dobiasd · 2019-03-27T15:06:13Z

Mhh, the profiler output does not look that much different compared to VGG19 runs as an example:

Your model:

VGG19:

So I don't yet know what is happening.

Dobiasd · 2019-03-27T15:28:57Z

To make sure the caching actually is working, I added

std::cout << "Apply layer '" << name_ << "'." << std::endl;

to the beginning of fdeep::layer::apply.

The output looks OK:

Apply layer 'C1D_Stacked'.
Apply layer 'diff'.
Apply layer ''.
Apply layer 'PaddingDiffEdges'.
Apply layer 'batch_normalization_2'.
Apply layer 'conv1d_6'.
Apply layer ''.
Apply layer 'conv1d_7'.
Apply layer ''.
Apply layer 'C1D_L0_D1'.
Apply layer ''.
Apply layer 'C1D_L0_D2'.
Apply layer ''.
Apply layer 'C1D_L0_D4'.
Apply layer ''.
Apply layer 'C1D_L0_D8'.
Apply layer ''.
Apply layer 'C1D_L0_D16'.
Apply layer ''.
Apply layer 'C1D_L0_D32'.
Apply layer ''.
Apply layer 'C1D_L0_D64'.
Apply layer ''.
Apply layer 'concatenate_3'.
Apply layer 'BN_L0'.
Apply layer 'Relu_L0'.
Apply layer 'conv1d_8'.
Apply layer ''.
Apply layer 'C1D_L1_D1'.
Apply layer ''.
Apply layer 'C1D_L1_D2'.
Apply layer ''.
Apply layer 'C1D_L1_D4'.
Apply layer ''.
Apply layer 'C1D_L1_D8'.
Apply layer ''.
Apply layer 'C1D_L1_D16'.
Apply layer ''.
Apply layer 'C1D_L1_D32'.
Apply layer ''.
Apply layer 'C1D_L1_D64'.
Apply layer ''.
Apply layer 'add_2'.
Apply layer 'BN_L1'.
Apply layer 'Relu_L1'.
Apply layer 'C1D_L2_D1'.
Apply layer ''.
Apply layer 'C1D_L2_D2'.
Apply layer ''.
Apply layer 'C1D_L2_D4'.
Apply layer ''.
Apply layer 'C1D_L2_D8'.
Apply layer ''.
Apply layer 'C1D_L2_D16'.
Apply layer ''.
Apply layer 'C1D_L2_D32'.
Apply layer ''.
Apply layer 'C1D_L2_D64'.
Apply layer ''.
Apply layer 'concatenate_4'.
Apply layer 'BN_L2'.
Apply layer 'Relu_L2'.
Apply layer 'SmoothingStep'.
Apply layer 'cropping1d_7'.
Apply layer 'zero_padding1d_3'.
Apply layer 'conv1d_9'.
Apply layer ''.
Apply layer 'drive_right_down'.
Apply layer ''.
Apply layer 'conv1d_10'.
Apply layer ''.
Apply layer 'hiccup_event'.
Apply layer ''.
Apply layer 'drive_sep_features'.
Apply layer ''.
Apply layer 'drive_sep_output'.
Apply layer ''.
Apply layer 'cropping1d_8'.
Apply layer 'zero_padding1d_4'.
Apply layer 'drive_dist'.
Apply layer ''.

So no layer is applied more than once.

🤷‍♂️

kmader · 2019-03-27T15:33:17Z

Thanks, definitely seems to be a good approach. Unfortunately trying to figure out what tensorflow does with the model will probably be a lot messier.

The caching would be the easiest problem to fix, I guess it is possibly some optimization that is being applied on the TF side or maybe how it deals with large dilation steps (_D64 means dilation_rate=64) that could be causing the problem? (the model for reference is sort of a stacked-wavenet which is why those dilations come up https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

Dobiasd · 2019-03-27T15:36:09Z

Ah, very good hint. I did not catch this property of the model.

Yes, such a dilation rate will cause frugally-deep to create huge tensors, consisting of mostly 0s. Here is the corresponding part of the code:

frugally-deep/include/fdeep/tensor5.hpp

Line 597 in bd9ca97

inline tensor5 dilate_tensor5(const shape2& dilation_rate, const tensor5& in)

Subsequently, these big tensors are put into the convolution step, which consequently take a lot of time then.

Dobiasd · 2019-03-27T15:43:29Z

I guess now we need a hero to step forward and dive into the convolution code of frugally-deep, finding and implementing some trick to better cope with dilation.

Anybody of the attendees, perhaps? 👀

kmader · 2019-03-27T21:35:47Z

Yea that makes sense, I can try to figure out what tensorflow lite does as maybe there is something easy that works for this.

Dobiasd · 2019-07-05T09:17:20Z

@kmader How is it going? Did you find out anything? :)

kmader · 2019-07-05T21:21:13Z

@Dobiasd sorry I started looking into it but got caught up by other projects, I'll try to dive a bit deeper next week and see if I can find anything

kmader closed this as completed Mar 27, 2019

Dobiasd reopened this Mar 27, 2019

Dobiasd changed the title ~~Performance on Branching Models~~ Poor performance with big dilation rates (was "Performance on Branching Models" before) Mar 27, 2019

Dobiasd added the help wanted label Aug 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

kmader commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019

Dobiasd commented Jul 5, 2019

kmader commented Jul 5, 2019

Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

Comments

kmader commented Mar 27, 2019 • edited Loading

Keras on CPU (tensorflow lite for the same model is 170µs)

Frugally Deep App

Model Details

kmader commented Mar 27, 2019 • edited Loading

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019

kmader commented Mar 27, 2019 • edited Loading

Keras Evaluation

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019 • edited Loading

kmader commented Mar 27, 2019 • edited Loading

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019 • edited Loading

Dobiasd commented Mar 27, 2019 • edited Loading

Dobiasd commented Mar 27, 2019

kmader commented Mar 27, 2019

Dobiasd commented Jul 5, 2019

kmader commented Jul 5, 2019

kmader commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

kmader commented Mar 27, 2019 •

edited

Loading

Dobiasd commented Mar 27, 2019 •

edited

Loading