How to generate and perform inference for an ONNX model #350

ghost · 2023-07-19T14:40:20Z

Thanks for the awesome work!
currently, I've been struggling with an issue while working with speedster which I will lay out below:
1. I've been able to optimize onnx model ( from HuggingFace, and is based on Donut https://github.com/clovaai/donut )

code used:

import numpy as np
from speedster import optimize_model
from speedster import save_model
import numpy as np
import torch
import os

Provide input data for the model
input_data = [((np.array(torch.randn(5, 3),dtype=np.int64), np.array(torch.randn(5, 3, 1024),dtype=np.float32), ), torch.tensor([0, 1, 0, 1, 1])) for _ in range(100)]

Run Speedster optimization
optimized_model = optimize_model(
    "./models/onnx/decoder_model.onnx",
    input_data=input_data,
    optimization_time="unconstrained",
    device="gpu:0",
    metric_drop_ths=0.8
)

save_model(optimized_model, "./models/speedster")

output:

2023-07-19 14:22:43 | INFO     | Running Speedster on GPU:0
2023-07-19 14:25:33 | INFO     | Benchmark performance of original model
2023-07-19 14:26:10 | INFO     | Original model latency: 0.023933820724487305 sec/iter
2023-07-19 14:26:11 | INFO     | [1/1] Running ONNX Optimization Pipeline
2023-07-19 14:26:11 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-07-19 14:26:14 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-07-19 14:26:16 | INFO     | Optimized model latency: 0.02505326271057129 sec/iter
2023-07-19 14:26:16 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:26:44 | INFO     | Optimized model latency: 0.3438906669616699 sec/iter
2023-07-19 14:26:44 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-07-19 14:28:18 | INFO     | Optimized model latency: 0.004456996917724609 sec/iter
2023-07-19 14:28:18 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:28:51 | INFO     | Optimized model latency: 0.003861665725708008 sec/iter
2023-07-19 14:28:51 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-07-19 14:33:56 | INFO     | Optimized model latency: 0.004480838775634766 sec/iter

[Speedster results on Tesla V100-SXM2-16GB]
Metric       Original Model    Optimized Model    Improvement
-----------  ----------------  -----------------  -------------
backend      NUMPY             TensorRT
latency      0.0239 sec/batch  0.0039 sec/batch   6.20x
throughput   208.91 data/sec   1294.78 data/sec   6.20x
model size   743.98 MB         254.43 MB          -65%
metric drop                    0.5291
techniques                     fp16

I am just hitting a wall when trying to perform inference.
code used:

from speedster import load_model
from nebullvm.tools.benchmark import benchmark
import numpy
import tensorflow as tf

optimized_model = load_model("../opt/models/speedster/")
print('speedster onnx model loaded')

device = "cuda" if torch.cuda.is_available() else "cpu"
dummy_input = torch.randn(1, 3, 300, 400, dtype=torch.float).to(device)
print(type(dummy_input))

output = optimized_model(dummy_input)
print(output)

observation:

2023-07-19 14:35:43 | WARNING  | Debug: Got extra keywords in NvidiaInferenceLearner::from_engine_path: {'class_name': 'NumpyONNXTensorRTInferenceLearner', 'module_name': 'nebullvm.operations.inference_learners.tensor_rt'}
speedster onnx model loaded
<class 'torch.Tensor'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-9-ea33d0034b2d>](https://localhost:8080/#) in <cell line: 20>()
     18 
     19 # Use the accelerated version of your ONNX model in production
---> 20 output = optimized_model(dummy_input)
     21 print(output)

5 frames
[/usr/local/lib/python3.10/dist-packages/polygraphy/cuda/cuda.py](https://localhost:8080/#) in dtype(self, new)
    296     def dtype(self, new):
    297         self._dtype = new
--> 298         self.itemsize = np.dtype(new).itemsize
    299 
    300     @property

TypeError: Cannot interpret 'torch.float32' as a data type

So my question would be what are the types of parameters I did to include for optimized_model() method here . Previously, I've been passing the following to original model to get it working

def run_prediction(test_sample, model=model, processor=processor):
    pixel_values = processor(test_sample, return_tensors="pt").pixel_values
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=False,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    prediction = processor.batch_decode(outputs.sequences)[0]
    prediction = processor.token2json(prediction)
    return prediction

Please let me know if you require additional information. thanks.

The text was updated successfully, but these errors were encountered:

ghost · 2023-07-22T12:31:34Z

been able to make some progress with optimizing ONNX model, but getting some errors when reaching the speedster optimization stage..
Please also find my collab link below > https://github.com/dneemuth/saversbasket/blob/main/Optimizing_Transformers_Speedster.ipynb

ghost · 2023-07-22T13:41:20Z

@mfumanelli sorry for interruption, but I was hoping you can point me in the right direction. been struggling with an issue while trying to optimize onnx model via speedster. I might be doing something wrong here.
I already have my script to replicate the issue on my google colab account if you want to have a look. thanks.

https://colab.research.google.com/drive/1eHYU0dKcM-ms3oL2pH6YWQ_qrWDSLWYH?usp=sharing

ghost changed the title ~~How to perform inference for an optimized onnx model~~ How to generate and perform inference for an onnx model Jul 22, 2023

ghost changed the title ~~How to generate and perform inference for an onnx model~~ How to generate and perform inference for an ONNX model Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to generate and perform inference for an ONNX model #350

How to generate and perform inference for an ONNX model #350

ghost commented Jul 19, 2023 •

edited by ghost

Loading

ghost commented Jul 22, 2023

ghost commented Jul 22, 2023 •

edited by ghost

Loading

How to generate and perform inference for an ONNX model #350

How to generate and perform inference for an ONNX model #350

Comments

ghost commented Jul 19, 2023 • edited by ghost Loading

ghost commented Jul 22, 2023

ghost commented Jul 22, 2023 • edited by ghost Loading

ghost commented Jul 19, 2023 •

edited by ghost

Loading

ghost commented Jul 22, 2023 •

edited by ghost

Loading