-
-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance with big dilation rates (was "Performance on Branching Models" before) #132
Comments
My suspicion would be the model being evaluated from finish to start and isn't caching any of the intermediate results and recomputing every value every time (which would be |
Good thinking, but that is not the case. Frugally-deep does cache the output of each node in the computational graph. :) I just ran your code on my machine, and the output is as follows:
The first forward pass (the test one) took 0.034413 s. So I would guess the reason for the bad performance you see is much more simple, e.g.:
OK, jokes apart, it's very likely the first of the three above. Please check this part of the FAQ for more details: https://github.com/Dobiasd/frugally-deep/blob/master/FAQ.md#why-is-my-prediction-roughly-100-times-slower-in-c-as-in-python |
Wow, thanks I have never seen that crazy of a speed up between
|
Running it on a vector sized
Keras Evaluation
|
Sorry, I'm not sure I understand correctly. So when using larger input vectors, you again have a performance problem with frugally-deep, or not? |
Yeah, Frugally-deep and especially FunctionalPlus, which is used, rely heavily on compiler optimizations. I tried to write the code as readable as possible for humans. For this I avoided manual optimizations, but always checked (with profiling) if the compilers are able to do them for us. :) |
Well on the summary of the very poorly conducted benchmarks above,
As the data gets larger by a factor of 30, Keras takes 5x longer and Frugally Deep takes 30x longer. So there are some other things coming into play in tensorflow that let it handle larger datasets better. |
Huh, that is interesting. I tested your model with the following code: #include "fdeep/fdeep.hpp"
int main()
{
const std::size_t test_runs = 3;
const auto model = fdeep::load_model("model.json", true);
fdeep::tensor5s inputs = {fdeep::tensor5(fdeep::shape5(1, 1, 1, 18000, 3), 0)};
fplus::stopwatch stopwatch;
for (std::size_t i = 0; i < test_runs; ++i)
{
std::cout << "Starting test run " << i << "." << std::endl;
model.predict(inputs);
}
const double duration_avg = stopwatch.elapsed() / static_cast<double>(test_runs);
std::cout << "Forward pass took " << duration_avg << " s on average." << std::endl;
} and had this output:
So indeed, it is quite slow. The equivalent Python code from keras.models import load_model
import numpy as np
import datetime
test_runs = 3
model = load_model('out_model.h5')
data_in = np.zeros((1, 18000, 3))
start_time = datetime.datetime.now()
for i in range(test_runs):
print('Starting test run {i}.')
model.predict(data_in)
end_time = datetime.datetime.now()
duration_avg = (end_time - start_time) / test_runs
print('Forward pass took {} s on average.'.format(duration_avg)) ran with: CUDA_VISIBLE_DEVICES='' taskset --cpu-list 1 python3 main.py is much faster:
I'll see if I can find out what is going on there, and will get back to you with the results. |
To make sure the caching actually is working, I added std::cout << "Apply layer '" << name_ << "'." << std::endl; to the beginning of The output looks OK:
So no layer is applied more than once. 🤷♂️ |
Thanks, definitely seems to be a good approach. Unfortunately trying to figure out what tensorflow does with the model will probably be a lot messier. The caching would be the easiest problem to fix, I guess it is possibly some optimization that is being applied on the TF side or maybe how it deals with large dilation steps ( |
Ah, very good hint. I did not catch this property of the model. Yes, such a dilation rate will cause frugally-deep to create huge tensors, consisting of mostly frugally-deep/include/fdeep/tensor5.hpp Line 597 in bd9ca97
Subsequently, these big tensors are put into the convolution step, which consequently take a lot of time then. |
I guess now we need a hero to step forward and dive into the convolution code of frugally-deep, finding and implementing some trick to better cope with dilation. Anybody of the attendees, perhaps? 👀 |
Yea that makes sense, I can try to figure out what tensorflow lite does as maybe there is something easy that works for this. |
@kmader How is it going? Did you find out anything? :) |
@Dobiasd sorry I started looking into it but got caught up by other projects, I'll try to dive a bit deeper next week and see if I can find anything |
Performance on fairly simple 1D model with stacked branches / bottlenecks / inception blocks was surprisingly poor (>300x slower).
Keras on CPU (tensorflow lite for the same model is 170µs)
Using the 1 CPU/Thread trick
Frugally Deep App
I realize the loop here includes loading and parsing the model, but that is a small fraction of the compute time (as can be seen by changing the length of the sequence)
Model Details
Model as Keras H5, FrugallyDeep JSON, and C++ file for inference: https://gist.github.com/kmader/135db41c5ea35c0dc8cae95ed90087f4
The text was updated successfully, but these errors were encountered: