Synchronous Inference Request

InferRequest class functionality:

  • Allocate input and output blobs needed for a backend-dependent network inference.

  • Define functions for inference process stages (for example, preprocess, upload, infer, download, postprocess). These functions can later be used to define an execution pipeline during Asynchronous Inference Request implementation.

  • Call inference stages one by one synchronously.

Class

Inference Engine Plugin API provides the helper InferenceEngine::IInferRequestInternal class recommended to use as a base class for a synchronous inference request implementation. Based of that, a declaration of a synchronous request class can look as follows:

class TemplateInferRequest : public InferenceEngine::IInferRequestInternal {
public:
    typedef std::shared_ptr<TemplateInferRequest> Ptr;

    TemplateInferRequest(const InferenceEngine::InputsDataMap& networkInputs,
                         const InferenceEngine::OutputsDataMap& networkOutputs,
                         const std::shared_ptr<ExecutableNetwork>& executableNetwork);
    TemplateInferRequest(const std::vector<std::shared_ptr<const ov::Node>>& inputs,
                         const std::vector<std::shared_ptr<const ov::Node>>& outputs,
                         const std::shared_ptr<ExecutableNetwork>& executableNetwork);
    ~TemplateInferRequest();

    void InferImpl() override;
    std::map<std::string, InferenceEngine::InferenceEngineProfileInfo> GetPerformanceCounts() const override;

    // pipeline methods-stages which are used in async infer request implementation and assigned to particular executor
    void inferPreprocess();
    void startPipeline();
    void waitPipeline();
    void inferPostprocess();

    InferenceEngine::Blob::Ptr GetBlob(const std::string& name) override;
    void SetBlob(const std::string& name, const InferenceEngine::Blob::Ptr& userBlob) override;

    void SetBlobsImpl(const std::string& name, const InferenceEngine::BatchedBlob::Ptr& batchedBlob) override;

private:
    void createInferRequest();
    void allocateDeviceBuffers();
    void allocateBlobs();

    enum { Preprocess, Postprocess, StartPipeline, WaitPipeline, numOfStages };

    std::shared_ptr<ExecutableNetwork> _executableNetwork;
    std::array<openvino::itt::handle_t, numOfStages> _profilingTask;
    // for performance counters
    std::array<std::chrono::duration<float, std::micro>, numOfStages> _durations;

    InferenceEngine::BlobMap _networkOutputBlobs;

    std::vector<std::shared_ptr<ngraph::runtime::Tensor>> _inputTensors;
    std::vector<std::shared_ptr<ngraph::runtime::Tensor>> _outputTensors;
    std::shared_ptr<ngraph::runtime::Executable> _executable;
};

Class Fields

The example class has several fields:

  • _executableNetwork - reference to an executable network instance. From this reference, an inference request instance can take a task executor, use counter for a number of created inference requests, and so on.

  • _profilingTask - array of the std::array<InferenceEngine::ProfilingTask, numOfStages> type. Defines names for pipeline stages. Used to profile an inference pipeline execution with the Intel® instrumentation and tracing technology (ITT).

  • _durations - array of durations of each pipeline stage.

  • _networkInputBlobs - input blob map.

  • _networkOutputBlobs - output blob map.

  • _parameters - ngraph::Function parameter operations.

  • _results - ngraph::Function result operations.

  • backend specific fields:

    • _inputTensors - inputs tensors which wrap _networkInputBlobs blobs. They are used as inputs to backend _executable computational graph.

    • _outputTensors - output tensors which wrap _networkOutputBlobs blobs. They are used as outputs from backend _executable computational graph.

    • _executable - an executable object / backend computational graph.

Constructor

The constructor initializes helper fields and calls methods which allocate blobs:

TemplateInferRequest::TemplateInferRequest(const InferenceEngine::InputsDataMap& networkInputs,
                                           const InferenceEngine::OutputsDataMap& networkOutputs,
                                           const std::shared_ptr<TemplatePlugin::ExecutableNetwork>& executableNetwork)
    : IInferRequestInternal(networkInputs, networkOutputs),
      _executableNetwork(executableNetwork) {
    createInferRequest();
}

TemplateInferRequest::TemplateInferRequest(const std::vector<std::shared_ptr<const ov::Node>>& inputs,
                                           const std::vector<std::shared_ptr<const ov::Node>>& outputs,
                                           const std::shared_ptr<TemplatePlugin::ExecutableNetwork>& executableNetwork)
    : IInferRequestInternal(inputs, outputs),
      _executableNetwork(executableNetwork) {
    createInferRequest();
}

void TemplateInferRequest::createInferRequest() {
    // TODO: allocate infer request device and host buffers if needed, fill actual list of profiling tasks

    auto requestID = std::to_string(_executableNetwork->_requestId.fetch_add(1));

    std::string name = _executableNetwork->_function->get_friendly_name() + "_Req" + requestID;
    _profilingTask = {
        openvino::itt::handle("Template" + std::to_string(_executableNetwork->_cfg.deviceId) + "_" + name +
                              "_Preprocess"),
        openvino::itt::handle("Template" + std::to_string(_executableNetwork->_cfg.deviceId) + "_" + name +
                              "_Postprocess"),
        openvino::itt::handle("Template" + std::to_string(_executableNetwork->_cfg.deviceId) + "_" + name +
                              "_StartPipline"),
        openvino::itt::handle("Template" + std::to_string(_executableNetwork->_cfg.deviceId) + "_" + name +
                              "_WaitPipline"),
    };

    _executable = _executableNetwork->_plugin->_backend->compile(_executableNetwork->_function);

    allocateDeviceBuffers();
    allocateBlobs();
}

Note

Call InferenceEngine::CNNNetwork::getInputsInfo and InferenceEngine::CNNNetwork::getOutputsInfo to specify both layout and precision of blobs, which you can set with InferenceEngine::InferRequest::SetBlob and get with InferenceEngine::InferRequest::GetBlob. A plugin uses these hints to determine its internal layouts and precisions for input and output blobs if needed.

Destructor

Decrements a number of created inference requests:

TemplateInferRequest::~TemplateInferRequest() {
    _executableNetwork->_requestId--;
}

Implementation details: Base IInferRequestInternal class implements the public InferenceEngine::IInferRequestInternal::Infer method as following:

  • Checks blobs set by users

  • Calls the InferImpl method defined in a derived class to call actual pipeline stages synchronously

void TemplateInferRequest::InferImpl() {
    // TODO: fill with actual list of pipeline stages, which are executed synchronously for sync infer requests
    inferPreprocess();
    startPipeline();
    waitPipeline();  // does nothing in current implementation
    inferPostprocess();
}

1.

Below is the code of the inferPreprocess method to demonstrate Inference Engine common preprocessing step handling:

void TemplateInferRequest::inferPreprocess() {
    OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, _profilingTask[Preprocess]);
    auto start = Time::now();
    convertBatchedInputBlobs();
    // NOTE: After IInferRequestInternal::execDataPreprocessing call
    //       input can points to other memory region than it was allocated in constructor.
    IInferRequestInternal::execDataPreprocessing(_deviceInputs);
    for (auto&& networkInput : _deviceInputs) {
        auto index = _executableNetwork->_inputIndex[networkInput.first];
        const auto& parameter = _executableNetwork->_function->get_parameters()[index];
        auto parameterShape = networkInput.second->getTensorDesc().getDims();
        auto srcShape = networkInput.second->getTensorDesc().getBlockingDesc().getBlockDims();
        const auto& parameterType = parameter->get_element_type();
        auto mem_blob = InferenceEngine::as<InferenceEngine::MemoryBlob>(networkInput.second);
        auto isNonRoiDesc = [](const BlockingDesc& desc) {
            size_t exp_stride = 1;
            for (size_t i = 0; i < desc.getBlockDims().size(); i++) {
                size_t rev_idx = desc.getBlockDims().size() - i - 1;
                OPENVINO_ASSERT(desc.getOrder()[rev_idx] == rev_idx,
                                "Template plugin: unsupported tensors with mixed axes order: ",
                                ngraph::vector_to_string(desc.getOrder()));
                if (desc.getStrides()[rev_idx] != exp_stride || desc.getOffsetPaddingToData()[rev_idx] != 0) {
                    return false;
                }
                exp_stride \*= desc.getBlockDims()[rev_idx];
            }
            return true;
        };
        if (isNonRoiDesc(networkInput.second->getTensorDesc().getBlockingDesc())) {
            // No ROI extraction is needed
            _inputTensors[index] = _executableNetwork->_plugin->_backend->create_tensor(parameterType,
                                                                                        parameterShape,
                                                                                        mem_blob->rmap().as<void\*>());
        } else {
            OPENVINO_ASSERT(parameterType.bitwidth() % 8 == 0,
                            "Template plugin: Unsupported ROI tensor with element type having ",
                            std::to_string(parameterType.bitwidth()),
                            " bits size");
            // Perform manual extraction of ROI tensor
            // Basic implementation doesn't take axis order into account `desc.getBlockingDesc().getOrder()`
            // Performance of manual extraction is not optimal, but it is ok for template implementation
            _inputTensors[index] = _executableNetwork->_plugin->_backend->create_tensor(parameterType, parameterShape);
            auto desc = mem_blob->getTensorDesc();
            auto\* src_data = mem_blob->rmap().as<uint8_t\*>();
            auto dst_tensor = std::dynamic_pointer_cast<ngraph::runtime::HostTensor>(_inputTensors[index]);
            OPENVINO_ASSERT(dst_tensor, "Template plugin error: Can't cast created tensor to HostTensor");
            auto\* dst_data = dst_tensor->get_data_ptr<uint8_t>();
            std::vector<size_t> indexes(parameterShape.size());
            for (size_t dst_idx = 0; dst_idx < ov::shape_size(parameterShape); dst_idx++) {
                size_t val = dst_idx;
                size_t src_idx = 0;
                for (size_t j1 = 0; j1 < indexes.size(); j1++) {
                    size_t j = indexes.size() - j1 - 1;
                    indexes[j] = val % parameterShape[j] + desc.getBlockingDesc().getOffsetPaddingToData()[j];
                    val /= parameterShape[j];
                    src_idx += indexes[j] \* desc.getBlockingDesc().getStrides()[j];
                }
                memcpy(dst_data + dst_idx \* parameterType.size(),
                       src_data + src_idx \* parameterType.size(),
                       parameterType.size());
            }
        }
    }
    for (auto&& output : _outputs) {
        auto outputBlob = output.second;
        auto networkOutput = _networkOutputBlobs[output.first];
        auto index = _executableNetwork->_outputIndex[output.first];
        if (outputBlob->getTensorDesc().getPrecision() == networkOutput->getTensorDesc().getPrecision()) {
            networkOutput = outputBlob;
        }
        const auto& result = _executableNetwork->_function->get_results()[index];
        if (result->get_output_partial_shape(0).is_dynamic()) {
            _outputTensors[index] = _executableNetwork->_plugin->_backend->create_tensor();
            continue;
        }
        const auto& resultShape = result->get_shape();
        const auto& resultType = result->get_element_type();
        _outputTensors[index] = _executableNetwork->_plugin->_backend->create_tensor(
            resultType,
            resultShape,
            InferenceEngine::as<InferenceEngine::MemoryBlob>(networkOutput)->wmap().as<void\*>());
    }
    _durations[Preprocess] = Time::now() - start;
}

Details:

  • InferImpl must call the InferenceEngine::IInferRequestInternal::execDataPreprocessing function, which executes common Inference Engine preprocessing step (for example, applies resize or color conversion operations) if it is set by the user. The output dimensions, layout and precision matches the input information set via InferenceEngine::CNNNetwork::getInputsInfo.

  • If inputBlob passed by user differs in terms of precisions from precision expected by plugin, blobCopy is performed which does actual precision conversion.

2.

Executes a pipeline synchronously using _executable object:

void TemplateInferRequest::startPipeline() {
    OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, _profilingTask[StartPipeline])
    auto start = Time::now();
    _executable->call(_outputTensors, _inputTensors);
    _durations[StartPipeline] = Time::now() - start;
}

3.

Converts output blobs if precisions of backend output blobs and blobs passed by user are different:

void TemplateInferRequest::inferPostprocess() {
    OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, _profilingTask[Postprocess]);
    auto start = Time::now();
    for (auto&& output : _networkOutputs) {
        auto index = _executableNetwork->_outputIndex[output.first];
        const auto& result = _executableNetwork->_function->get_results()[index];
        if (result->get_output_partial_shape(0).is_dynamic()) {
            // Touch blob to allocate it
            GetBlob(output.first);
        }
        auto outputBlob = _outputs.at(output.first);
        auto networkOutput = _networkOutputBlobs[output.first];
        if (outputBlob->getTensorDesc().getPrecision() != networkOutput->getTensorDesc().getPrecision()) {
            blobCopy(networkOutput, outputBlob);
        } else if (result->get_output_partial_shape(0).is_dynamic()) {
            auto tensor = _outputTensors[_executableNetwork->_outputIndex.at(output.first)];
            tensor->read(InferenceEngine::as<InferenceEngine::MemoryBlob>(outputBlob)->wmap().as<char\*>(),
                         tensor->get_size_in_bytes());
        }
    }
    _durations[Postprocess] = Time::now() - start;
}

The method sets performance counters which were measured during pipeline stages execution:

std::map<std::string, InferenceEngineProfileInfo> TemplateInferRequest::GetPerformanceCounts() const {
    std::map<std::string, InferenceEngineProfileInfo> perfMap;
    InferenceEngineProfileInfo info;
    info.execution_index = 0;
    info.status = InferenceEngineProfileInfo::EXECUTED;

    info.cpu_uSec = info.realTime_uSec = _durations[Preprocess].count();
    perfMap["1. input preprocessing"] = info;
    info.cpu_uSec = info.realTime_uSec = 0;
    perfMap["2. input transfer to a device"] = info;
    info.cpu_uSec = info.realTime_uSec = _durations[StartPipeline].count();
    perfMap["3. execution time"] = info;
    info.cpu_uSec = info.realTime_uSec = 0;
    perfMap["4. output transfer from a device"] = info;
    info.cpu_uSec = info.realTime_uSec = _durations[Postprocess].count();
    perfMap["5. output postprocessing"] = info;
    return perfMap;
}

The next step in the plugin library implementation is the Asynchronous Inference Request class.