Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to generate and perform inference for an ONNX model #350

Open
ghost opened this issue Jul 19, 2023 · 2 comments
Open

How to generate and perform inference for an ONNX model #350

ghost opened this issue Jul 19, 2023 · 2 comments

Comments

@ghost
Copy link

ghost commented Jul 19, 2023

Thanks for the awesome work!
currently, I've been struggling with an issue while working with speedster which I will lay out below:
1. I've been able to optimize onnx model ( from HuggingFace, and is based on Donut https://github.com/clovaai/donut )

code used:

import numpy as np
from speedster import optimize_model
from speedster import save_model
import numpy as np
import torch
import os

Provide input data for the model
input_data = [((np.array(torch.randn(5, 3),dtype=np.int64), np.array(torch.randn(5, 3, 1024),dtype=np.float32), ), torch.tensor([0, 1, 0, 1, 1])) for _ in range(100)]

Run Speedster optimization
optimized_model = optimize_model(
    "./models/onnx/decoder_model.onnx",
    input_data=input_data,
    optimization_time="unconstrained",
    device="gpu:0",
    metric_drop_ths=0.8
)

save_model(optimized_model, "./models/speedster")

output:

2023-07-19 14:22:43 | INFO     | Running Speedster on GPU:0
2023-07-19 14:25:33 | INFO     | Benchmark performance of original model
2023-07-19 14:26:10 | INFO     | Original model latency: 0.023933820724487305 sec/iter
2023-07-19 14:26:11 | INFO     | [1/1] Running ONNX Optimization Pipeline
2023-07-19 14:26:11 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-07-19 14:26:14 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-07-19 14:26:16 | INFO     | Optimized model latency: 0.02505326271057129 sec/iter
2023-07-19 14:26:16 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:26:44 | INFO     | Optimized model latency: 0.3438906669616699 sec/iter
2023-07-19 14:26:44 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-07-19 14:28:18 | INFO     | Optimized model latency: 0.004456996917724609 sec/iter
2023-07-19 14:28:18 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:28:51 | INFO     | Optimized model latency: 0.003861665725708008 sec/iter
2023-07-19 14:28:51 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-07-19 14:33:56 | INFO     | Optimized model latency: 0.004480838775634766 sec/iter

[Speedster results on Tesla V100-SXM2-16GB]
Metric       Original Model    Optimized Model    Improvement
-----------  ----------------  -----------------  -------------
backend      NUMPY             TensorRT
latency      0.0239 sec/batch  0.0039 sec/batch   6.20x
throughput   208.91 data/sec   1294.78 data/sec   6.20x
model size   743.98 MB         254.43 MB          -65%
metric drop                    0.5291
techniques                     fp16
  1. I am just hitting a wall when trying to perform inference.
    code used:
from speedster import load_model
from nebullvm.tools.benchmark import benchmark
import numpy
import tensorflow as tf

optimized_model = load_model("../opt/models/speedster/")
print('speedster onnx model loaded')

device = "cuda" if torch.cuda.is_available() else "cpu"
dummy_input = torch.randn(1, 3, 300, 400, dtype=torch.float).to(device)
print(type(dummy_input))

output = optimized_model(dummy_input)
print(output)

observation:

2023-07-19 14:35:43 | WARNING  | Debug: Got extra keywords in NvidiaInferenceLearner::from_engine_path: {'class_name': 'NumpyONNXTensorRTInferenceLearner', 'module_name': 'nebullvm.operations.inference_learners.tensor_rt'}
speedster onnx model loaded
<class 'torch.Tensor'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-9-ea33d0034b2d>](https://localhost:8080/#) in <cell line: 20>()
     18 
     19 # Use the accelerated version of your ONNX model in production
---> 20 output = optimized_model(dummy_input)
     21 print(output)

5 frames
[/usr/local/lib/python3.10/dist-packages/polygraphy/cuda/cuda.py](https://localhost:8080/#) in dtype(self, new)
    296     def dtype(self, new):
    297         self._dtype = new
--> 298         self.itemsize = np.dtype(new).itemsize
    299 
    300     @property

TypeError: Cannot interpret 'torch.float32' as a data type

So my question would be what are the types of parameters I did to include for optimized_model() method here . Previously, I've been passing the following to original model to get it working

def run_prediction(test_sample, model=model, processor=processor):
    pixel_values = processor(test_sample, return_tensors="pt").pixel_values
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=False,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    prediction = processor.batch_decode(outputs.sequences)[0]
    prediction = processor.token2json(prediction)
    return prediction 

Please let me know if you require additional information. thanks.

@ghost
Copy link
Author

ghost commented Jul 22, 2023

been able to make some progress with optimizing ONNX model, but getting some errors when reaching the speedster optimization stage..
Please also find my collab link below > https://github.com/dneemuth/saversbasket/blob/main/Optimizing_Transformers_Speedster.ipynb

@ghost
Copy link
Author

ghost commented Jul 22, 2023

@mfumanelli sorry for interruption, but I was hoping you can point me in the right direction. been struggling with an issue while trying to optimize onnx model via speedster. I might be doing something wrong here.
I already have my script to replicate the issue on my google colab account if you want to have a look. thanks.

https://colab.research.google.com/drive/1eHYU0dKcM-ms3oL2pH6YWQ_qrWDSLWYH?usp=sharing

@ghost ghost changed the title How to perform inference for an optimized onnx model How to generate and perform inference for an onnx model Jul 22, 2023
@ghost ghost changed the title How to generate and perform inference for an onnx model How to generate and perform inference for an ONNX model Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants