Nebullvm

nebullvm is an open-source tool designed to speed up AI inference in just a few lines of code. nebullvm boosts your model to achieve the maximum acceleration that is physically possible on your hardware.

We are building a new AI inference acceleration product leveraging state-of-the-art open-source optimization tools enabling the optimization of the whole software to hardware stack. If you like the idea, give us a star to support the project ⭐

The core nebullvm workflow consists of 3 steps:

Select: input your model in your preferred DL framework and express your preferences regarding:
- Accuracy loss: do you want to trade off a little accuracy for much higher performance?
- Optimization time: stellar accelerations can be time-consuming. Can you wait, or do you need an instant answer?
Search: nebullvm automatically tests every combination of optimization techniques across the software-to-hardware stack (sparsity, quantization, compilers, etc.) that is compatible with your needs and local hardware.
Serve: finally, nebullvm chooses the best configuration of optimization techniques and returns an accelerated version of your model in the DL framework of your choice (just on steroids 🚀).

API quick view

Only a single line of code is needed to get your accelerated model:

import torch
import torchvision.models as models
from nebullvm.api.functions import optimize_model

# Load a resnet as example
model = models.resnet50()

# Provide an input data for the model
input_data = [((torch.randn(1, 3, 256, 256), ), 0)]

# Run nebullvm optimization in one line of code
optimized_model = optimize_model(
    model, input_data=input_data, optimization_time="constrained"
)

# Try the optimized model
x = torch.randn(1, 3, 256, 256)
res = optimized_model(x)

For more details, please visit Installation and Get started.

How it works

We are not here to reinvent the wheel, but to build an all-in-one open-source product to master all the available AI acceleration techniques and deliver the fastest AI ever. As a result, nebullvm leverages available enterprise-grade open-source optimization tools. If these tools and communities already exist, and are distributed under a permissive license (Apache, MIT, etc), we integrate them and happily contribute to their communities. However, many tools do not exist yet, in which case we implement them and open-source the code so that the community can benefit from it.

Product design

nebullvm is shaped around 4 building blocks and leverages a modular design to foster scalability and integration of new acceleration components across the stack.

Converter: converts the input model from its original framework to the framework backends supported by nebullvm, namely PyTorch, TensorFlow, and ONNX. This allows the Compressor and Optimizer modules to apply any optimization technique to the model.
Compressor: applies various compression techniques to the model, such as pruning, knowledge distillation, or quantization-aware training.
Optimizer: converts the compressed models to the intermediate representation (IR) of the supported deep learning compilers. The compilers apply both post-training quantization techniques and graph optimizations, to produce compiled binary files.
Inference Learner: takes the best performing compiled model and converts it to the same interface as the original input model.

The compressor stage leverages the following open-source projects:

Intel/neural-compressor: targeting to provide unified APIs for network compression technologies, such as low precision quantization, sparsity, pruning, knowledge distillation, across different deep learning frameworks to pursue optimal inference performance.
SparseML: libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models.

The optimizer stage leverages the following open-source projects:

Apache TVM: open deep learning compiler stack for cpu, gpu and specialized accelerators.
BladeDISC: end-to-end Dynamic Shape Compiler project for machine learning workloads.
DeepSparse: neural network inference engine that delivers GPU-class performance for sparsified models on CPUs.
OpenVINO: open-source toolkit for optimizing and deploying AI inference.
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
TensorRT: C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
TFlite and XLA: open-source libraries to accelerate TensorFlow models.

Documentation

Installation
Get started
Notebooks
Benchmarks
Supported features and roadmap

Community

Discord: best for sharing your projects, hanging out with the community and learning about AI acceleration.
Github issues: ideal for suggesting new acceleration components, requesting new features, and reporting bugs and improvements.

We’re developing nebullvm together with our community so the best way to get started is to pick a good-first issue. Please read our contribution guidelines for a deep dive on how to best contribute to our project!

Don't forget to leave a star ⭐ to support the project and happy acceleration 🚀

Status

Model converter backends
- ONNX, PyTorch, TensorFlow
- Jax
Compressor
- Pruning and sparsity
- Quantized-aware training, distillation, layer replacement and low rank compression
Optimizer
- TensorRT, OpenVINO, ONNX Runtime, TVM, PyTorch, DeepSparse, BladeDisc
- TFlite, XLA
Inference learners
- PyTorch, ONNX, Hugging Face, TensorFlow
- Jax

Join the community | Contribute to the library

Installation • Get started • Notebooks • Benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Nebullvm

API quick view

How it works

Product design

Documentation

Community

Status

Files

README.md

Latest commit

History

README.md

File metadata and controls

Nebullvm

API quick view

How it works

Product design

Documentation

Community

Status