Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance with big dilation rates (was "Performance on Branching Models" before) #132

Open
kmader opened this issue Mar 27, 2019 · 16 comments

Comments

@kmader
Copy link
Contributor

kmader commented Mar 27, 2019

Performance on fairly simple 1D model with stacked branches / bottlenecks / inception blocks was surprisingly poor (>300x slower).

Keras on CPU (tensorflow lite for the same model is 170µs)

Using the 1 CPU/Thread trick

import tensorflow as tf, keras
session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)
keras.backend.set_session(sess)
model = model_from_json(json_blob)
%%time
model.predict(np.zeros((1, 600, 3)));
CPU times: user 88.1 ms, sys: 35.1 ms, total: 123 ms
Wall time: 25.1 ms

Frugally Deep App

I realize the loop here includes loading and parsing the model, but that is a small fraction of the compute time (as can be seen by changing the length of the sequence)

!time ../build/main
Loading model
Loading json ... done. elapsed time: 0.192233 s
Running test 1 of 1 ... done. elapsed time: 1.472293 s
Loading, constructing, testing of ../model.json took 3.309670 s overall.
Running all zeros: 

real	0m31.150s
user	0m30.712s
sys	0m0.245s

Model Details

Model as Keras H5, FrugallyDeep JSON, and C++ file for inference: https://gist.github.com/kmader/135db41c5ea35c0dc8cae95ed90087f4

out_model

@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 3*8*8*8 as much work), but I haven't had time to dive into the code and figure out if that is really the case.

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be 388*8 as much work)

Good thinking, but that is not the case. Frugally-deep does cache the output of each node in the computational graph. :)

I just ran your code on my machine, and the output is as follows:

Loading model
Loading json ... done. elapsed time: 0.100226 s
Running test 1 of 1 ... done. elapsed time: 0.034413 s
Loading, constructing, testing of ./model.json took 0.250775 s overall.
Running all zeros: 

real    0m0.863s
user    0m0.742s
sys     0m0.116s

The first forward pass (the test one) took 0.034413 s.

So I would guess the reason for the bad performance you see is much more simple, e.g.:

  • You did not tell your compiler to produce optimized machine code.
  • You are using hardware from the 90s.
  • You are mining too much bitcoin in the background.

OK, jokes apart, it's very likely the first of the three above. Please check this part of the FAQ for more details: https://github.com/Dobiasd/frugally-deep/blob/master/FAQ.md#why-is-my-prediction-roughly-100-times-slower-in-c-as-in-python

@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Loading model
Loading json ... done. elapsed time: 0.064939 s
Running test 1 of 1 ... done. elapsed time: 0.037276 s
Loading, constructing, testing of ../model.json took 0.270976 s overall.
Running all zeros: 

real	0m0.990s
user	0m0.797s
sys	0m0.186s

@kmader kmader closed this as completed Mar 27, 2019
@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

Running it on a vector sized 1, 1, 18000, 3 it seems to scale very differently (I guess the keras/tf runtime is mostly overhead while in frugally deep it is mostly compute, but is now a factor of 60

Loading model
Loading json ... done. elapsed time: 0.059423 s
Running test 1 of 1 ... done. elapsed time: 0.037828 s
Loading, constructing, testing of ../model.json took 0.263522 s overall.
Running all zeros: 

real	0m29.779s
user	0m20.428s
sys	0m7.359s

Keras Evaluation

%%time
_ = model.predict(np.random.uniform(size=(1, 3600, 3)))
CPU times: user 493 ms, sys: 211 ms, total: 705 ms
Wall time: 143 ms

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

Sorry, I'm not sure I understand correctly.

So when using larger input vectors, you again have a performance problem with frugally-deep, or not?

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

Wow, thanks I have never seen that crazy of a speed up between -g and -03, issue resolved!

Yeah, Frugally-deep and especially FunctionalPlus, which is used, rely heavily on compiler optimizations. I tried to write the code as readable as possible for humans. For this I avoided manual optimizations, but always checked (with profiling) if the compilers are able to do them for us. :)

@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

Well on the summary of the very poorly conducted benchmarks above,

Vector Size Keras Frugally Deep
600 0.088 0.990
18000 0.493 29.779

As the data gets larger by a factor of 30, Keras takes 5x longer and Frugally Deep takes 30x longer. So there are some other things coming into play in tensorflow that let it handle larger datasets better.

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

Huh, that is interesting.

I tested your model with the following code:

#include "fdeep/fdeep.hpp"
int main()
{
    const std::size_t test_runs = 3;
    const auto model = fdeep::load_model("model.json", true);
    fdeep::tensor5s inputs = {fdeep::tensor5(fdeep::shape5(1, 1, 1, 18000, 3), 0)};
    fplus::stopwatch stopwatch;
    for (std::size_t i = 0; i < test_runs; ++i)
    {
        std::cout << "Starting test run " << i << "." << std::endl;
        model.predict(inputs);
    }
    const double duration_avg = stopwatch.elapsed() / static_cast<double>(test_runs);
    std::cout << "Forward pass took " << duration_avg << " s on average." << std::endl;
}

and had this output:

Loading json ... done. elapsed time: 0.098866 s
Running test 1 of 1 ... done. elapsed time: 0.034203 s
Loading, constructing, testing of model.json took 0.254174 s overall.
Starting test run 0
Starting test run 1
Starting test run 2
Forward pass took 19.488 s on average.

So indeed, it is quite slow.


The equivalent Python code

from keras.models import load_model
import numpy as np
import datetime

test_runs = 3
model = load_model('out_model.h5')
data_in = np.zeros((1, 18000, 3))
start_time = datetime.datetime.now()
for i in range(test_runs):
    print('Starting test run {i}.')
    model.predict(data_in)
end_time = datetime.datetime.now()
duration_avg = (end_time - start_time) / test_runs
print('Forward pass took {} s on average.'.format(duration_avg))

ran with:

CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 main.py

is much faster:

Starting test run 0.
Starting test run 1.
Starting test run 2.
Forward pass took 0:00:01.365790 s on average.

I'll see if I can find out what is going on there, and will get back to you with the results.

@Dobiasd Dobiasd reopened this Mar 27, 2019
@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

Mhh, the profiler output does not look that much different compared to VGG19 runs as an example:

Your model:
Y8nKFMJ

VGG19:
fedoxop

So I don't yet know what is happening.

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

To make sure the caching actually is working, I added

std::cout << "Apply layer '" << name_ << "'." << std::endl;

to the beginning of fdeep::layer::apply.

The output looks OK:

Apply layer 'C1D_Stacked'.
Apply layer 'diff'.
Apply layer ''.
Apply layer 'PaddingDiffEdges'.
Apply layer 'batch_normalization_2'.
Apply layer 'conv1d_6'.
Apply layer ''.
Apply layer 'conv1d_7'.
Apply layer ''.
Apply layer 'C1D_L0_D1'.
Apply layer ''.
Apply layer 'C1D_L0_D2'.
Apply layer ''.
Apply layer 'C1D_L0_D4'.
Apply layer ''.
Apply layer 'C1D_L0_D8'.
Apply layer ''.
Apply layer 'C1D_L0_D16'.
Apply layer ''.
Apply layer 'C1D_L0_D32'.
Apply layer ''.
Apply layer 'C1D_L0_D64'.
Apply layer ''.
Apply layer 'concatenate_3'.
Apply layer 'BN_L0'.
Apply layer 'Relu_L0'.
Apply layer 'conv1d_8'.
Apply layer ''.
Apply layer 'C1D_L1_D1'.
Apply layer ''.
Apply layer 'C1D_L1_D2'.
Apply layer ''.
Apply layer 'C1D_L1_D4'.
Apply layer ''.
Apply layer 'C1D_L1_D8'.
Apply layer ''.
Apply layer 'C1D_L1_D16'.
Apply layer ''.
Apply layer 'C1D_L1_D32'.
Apply layer ''.
Apply layer 'C1D_L1_D64'.
Apply layer ''.
Apply layer 'add_2'.
Apply layer 'BN_L1'.
Apply layer 'Relu_L1'.
Apply layer 'C1D_L2_D1'.
Apply layer ''.
Apply layer 'C1D_L2_D2'.
Apply layer ''.
Apply layer 'C1D_L2_D4'.
Apply layer ''.
Apply layer 'C1D_L2_D8'.
Apply layer ''.
Apply layer 'C1D_L2_D16'.
Apply layer ''.
Apply layer 'C1D_L2_D32'.
Apply layer ''.
Apply layer 'C1D_L2_D64'.
Apply layer ''.
Apply layer 'concatenate_4'.
Apply layer 'BN_L2'.
Apply layer 'Relu_L2'.
Apply layer 'SmoothingStep'.
Apply layer 'cropping1d_7'.
Apply layer 'zero_padding1d_3'.
Apply layer 'conv1d_9'.
Apply layer ''.
Apply layer 'drive_right_down'.
Apply layer ''.
Apply layer 'conv1d_10'.
Apply layer ''.
Apply layer 'hiccup_event'.
Apply layer ''.
Apply layer 'drive_sep_features'.
Apply layer ''.
Apply layer 'drive_sep_output'.
Apply layer ''.
Apply layer 'cropping1d_8'.
Apply layer 'zero_padding1d_4'.
Apply layer 'drive_dist'.
Apply layer ''.

So no layer is applied more than once.

🤷‍♂️

@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

Thanks, definitely seems to be a good approach. Unfortunately trying to figure out what tensorflow does with the model will probably be a lot messier.

The caching would be the easiest problem to fix, I guess it is possibly some optimization that is being applied on the TF side or maybe how it deals with large dilation steps (_D64 means dilation_rate=64) that could be causing the problem? (the model for reference is sort of a stacked-wavenet which is why those dilations come up https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

Ah, very good hint. I did not catch this property of the model.

Yes, such a dilation rate will cause frugally-deep to create huge tensors, consisting of mostly 0s. Here is the corresponding part of the code:

inline tensor5 dilate_tensor5(const shape2& dilation_rate, const tensor5& in)

Subsequently, these big tensors are put into the convolution step, which consequently take a lot of time then.

@Dobiasd Dobiasd changed the title Performance on Branching Models Poor performance with big dilation rates (was "Performance on Branching Models" before) Mar 27, 2019
@Dobiasd
Copy link
Owner

Dobiasd commented Mar 27, 2019

I guess now we need a hero to step forward and dive into the convolution code of frugally-deep, finding and implementing some trick to better cope with dilation. :goberserk:

Anybody of the attendees, perhaps? 👀

@kmader
Copy link
Contributor Author

kmader commented Mar 27, 2019

Yea that makes sense, I can try to figure out what tensorflow lite does as maybe there is something easy that works for this.

@Dobiasd
Copy link
Owner

Dobiasd commented Jul 5, 2019

@kmader How is it going? Did you find out anything? :)

@kmader
Copy link
Contributor Author

kmader commented Jul 5, 2019

@Dobiasd sorry I started looking into it but got caught up by other projects, I'll try to dive a bit deeper next week and see if I can find anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants