Hardware acceleration is the use of specialized computer hardware to improve the execution speed and efficiency of an AI model. For LiteRT, this primarily involves using Graphics Processing Units (GPU) or Neural Processing Units (NPU) for AI inference, as well as general-purpose Central Processing Units (CPUs) vector instructions.
LiteRT supported hardware acceleration through the use of TFLite Delegates, which takes over parts of the LiteRT graph by substituting its own operations in the graph. LiteRT Next improves upon this process by handling hardware acceleration through two steps:
- Compilation: prepare a model to run with a specific hardware.
- Dispatch: run selected operations on the relevant hardware.
The compilation phase modifies a LiteRT model with a new interface that offers more flexibility though compiler plugins. Model compilation occurs ahead of time (AOT), before the graph is executed, and tailors a specific graph to run on the device.
Types of accelerators
LiteRT provides three types of accelerators: NPU, GPU and CPU.
- The NPU acceleration supports specialized hardware unified behind a single interface. NPU support is available through an Early Access Program.
- The GPU acceleration supports WebGL and OpenCL enabled devices.
- The CPU acceleration supports a variety of processors through the use of the XNNPack library. This is the default level of acceleration and is always available.
These accelerators may be combined to get the best performance possible when some complex operations are not available on a given hardware. When accelerators compete over an operation, LiteRT uses the following order of precedence: NPU, GPU, CPU.
GPU acceleration
With LiteRT Next's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism. LiteRT Next uses a new and improved GPU delegate, not offered by LiteRT.
Running models on GPU with LiteRT requires explicit delegate creation, function calls, and graph modifications. With LiteRT Next, just specify the accelerator:
// Create a compiled model targeting GPU
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));
For more information on GPU acceleration, see GPU acceleration with LiteRT Acceleration
NPU acceleration
LiteRT Next provides a unified interface to harness NPUs without forcing you to individually navigate vendor-specific compilers, runtimes, or library dependencies. Using LiteRT Next for NPU acceleration avoids many vendor-specific and device-specific complications, boosts performance for real-time and large-model inference, and minimizes memory copies with zero-copy hardware buffer usage.
Using NPUs with LiteRT involves converting and compiling a model with Play for On-device AI (PODAI) and deploying the model with Play AI Pack and Feature Module.