Using Advanced Throughput Options: Streams and Batching¶
OpenVINO Streams¶
As detailed in the common-optimizations section running multiple inference requests asynchronously is important for general application efficiency. Internally, every device implements a queue. The queue acts as a buffer, storing the inference requests until retrieved by the device at its own pace. The devices may actually process multiple inference requests in parallel in order to improve the device utilization and overall throughput. This configurable mean of this device-side parallelism is commonly referred as streams.
Note
Notice that streams are really executing the requests in parallel, but not in the lock step (as e.g. the batching does), which makes the streams fully compatible with dynamically-shaped inputs when individual requests can have different shapes.
Note
Most OpenVINO devices (including CPU, GPU and VPU) support the streams, yet the optimal number of the streams is deduced very differently, please see the a dedicated section below.
Few general considerations:
Using the streams does increase the latency of an individual request
When no number of streams is not specified, a device creates a bare minimum of streams (usually just one), as the latency-oriented case is default
Please find further tips for the optimal number of the streams below
Streams are memory-hungry, as every stream duplicates the intermediate buffers to do inference in parallel to the rest of streams
Always prefer streams over creating multiple
ov:Compiled_Model
instances for the same model, as weights memory is shared across streams, reducing the memory consumption
Notice that the streams also inflate the model load (compilation) time.
For efficient asynchronous execution, the streams are actually handling the inference with a special pool of the threads (a thread per stream). Each time you start inference requests (potentially from different application threads), they are actually muxed into a inference queue of the particular ov:Compiled_Model
. If there is a vacant stream, it pops the request from the queue and actually expedites that to the on-device execution. There are further device-specific details e.g. for the CPU, that you may find in the internals section.
Batching¶
Hardware accelerators like GPUs are optimized for massive compute parallelism, so the batching helps to saturate the device and leads to higher throughput. While the streams (described earlier) already help to hide the communication overheads and certain bubbles in the scheduling, running multiple OpenCL kernels simultaneously is less GPU-efficient, compared to calling a kernel on the multiple inputs at once.
As explained in the next section, the batching is a must to leverage maximum throughput on the GPUs.
There are two primary ways of using the batching to help application performance:
Collecting the inputs explicitly on the application side and then sending these batched requests to the OpenVINO
Although this gives flexibility with the possible batching strategies, the approach requires redesigning the application logic
Sending individual requests, while configuring the OpenVINO to collect and perform inference on the requests in batch automatically. In both cases, optimal batch size is very device-specific. Also as explained below, the optimal batch size depends on the model, inference precision and other factors.
Choosing the Number of Streams and/or Batch Size¶
Predicting the inference performance is difficult and finding optimal execution parameters requires direct experiments with measurements. Run performance testing in the scope of development, and make sure to validate overall (end-to-end) application performance.
Different devices behave differently with the batch sizes. The optimal batch size depends on the model, inference precision and other factors. Similarly, different devices require different number of execution streams to saturate. Finally, in some cases combination of streams and batching may be required to maximize the throughput.
One possible throughput optimization strategy is to set an upper bound for latency and then increase the batch size and/or number of the streams until that tail latency is met (or the throughput is not growing anymore). Also, consider OpenVINO Deep Learning Workbench that builds handy latency vs throughput charts, iterating over possible values of the batch size and number of streams.
Note
When playing with dynamically-shaped inputs use only the streams (no batching), as they tolerate individual requests having different shapes.
Note
Using the High-Level Performance Hints is the alternative, portable and future-proof option, allowing the OpenVINO to find best combination of streams and batching for a given scenario and model.
Number of Streams Considerations¶
Select the number of streams is it is less or equal to the number of requests that your application would be able to runs simultaneously
To avoid wasting resources, the number of streams should be enough to meet the average parallel slack rather than the peak load
As a more portable option (that also respects the underlying hardware configuration) use the
ov::streams::AUTO
It is very important to keep these streams busy, by running as many inference requests as possible (e.g. start the newly-arrived inputs immediately)
Bare minimum of requests to saturate the device can be queried as
ov::optimal_number_of_infer_requests
of theov:Compiled_Model
Maximum number of streams for the device (per model) can be queried as the
ov::range_for_streams
Batch Size Considerations¶
Select the batch size that is equal to the number of requests that your application is able to runs simultaneously
Otherwise (or if the number of “available” requests fluctuates), you may need to keep several instances of the network (reshaped to the different batch size) and select the properly sized instance in the runtime accordingly
For OpenVINO devices that internally implement a dedicated heuristic, the
ov::optimal_batch_size
is a device property (that accepts the actual model as a parameter) to query the recommended batch size for the model.
Few Device Specific Details¶
For the GPU :
When the parallel slack is small (e.g. only 2-4 requests executed simultaneously), then using only the streams for the GPU may suffice
Notice that the GPU runs 2 request per stream, so 4 requests can be served by 2 streams
Alternatively, consider single stream with with 2 requests (each with a small batch size like 2), which would total the same 4 inputs in flight
Typically, for 4 and more requests the batching delivers better throughput
Batch size can be calculated as “number of inference requests executed in parallel” divided by the “number of requests that the streams consume”
E.g. if you process 16 cameras (by 16 requests inferenced simultaneously) by the two GPU streams (each can process two requests), the batch size per request is 16/(2*2)=4
For the CPU always use the streams first
On the high-end CPUs, using moderate (2-8) batch size in addition to the maximum number of streams, may further improve the performance.