smartlab-sequence-modelling-0001

Use Case and High-Level Description

This is an online action segmentation network for 16 classes trained on Intel dataset. It is an online version of MSTCN++. The difference between online MSTCN++ and MSTCN++ is that the former accept stream video as input while the latter assume the whole video is given.

For the original MSTCN++ model details see paper

Specification

Metric

Value

GOPs

0.048915

MParams

1.018179

Source framework

PyTorch*

Accuracy

Accuracy noise/background remove_support_sleeve adjust_rider adjust_nut adjust_balancing open_box close_box choose_weight put_left put_right take_left take_right install support_sleeve mean mPR (P+R)/2
frame-level precision 0.22 0.84 0.81 0.62 0.67 0.87 0.56 0.52 0.54 0.74 0.62 0.68 0.86 0.66 0.66
recall 0.4 0.95 0.83 0.86 0.43 0.8 0.31 0.52 0.68 0.65 0.62 0.51 0.92 0.65
segment IOU precision 0.38 0.94 0.77 0.65 0.6 0.85 0.56 0.68 0.74 0.88 0.72 0.78 0.69 0.7 0.77
recall 0.64 1 0.96 0.94 0.62 0.96 0.48 0.77 0.91 0.88 0.83 0.85 1 0.83

Notice: In the accuracy report, feature extraction network is i3d-rgb, you can get this model from ../../public/i3d-rgb-tf/README.md.

Inputs

The inputs to the network are feature vectors at each video frame, which should be the output of feature extraction network, such as i3d-rgb-tf and resnet-50-tf, and feature outputs of the previous frame.

You can check the i3d-rgb and smartlab-sequence-modelling-0001 usage in demos/smartlab_demo

  1. Input feature, name: input, shape: 1, 2048, 24, format: B, W, H, where:

    • B - batch size

    • W - feature map width

    • H - feature map height

  2. History feature 1, name: fhis_in_0, shape: 12, 64, 2048, format: C, H', W,

  3. History feature 2, name: fhis_in_1, shape: 11, 64, 2048, format: C, H', W,

  4. History feature 3, name: fhis_in_2, shape: 11, 64, 2048, format: C, H', W,

  5. History feature 4, name: fhis_in_3, shape: 11, 64, 2048, format: C, H', W, where:

    • C - the channel number of feature vector

    • H- feature map height

    • W - feature map width

Outputs

The outputs also include two parts: predictions and four feature outputs. Predictions is the action classification and prediction results. Four Feature maps are the model layer features in past frames.

  1. Prediction, name: output, shape: 4, 1, 64, 24, format: C, B, H, W,

    • C - the channel number of feature vector

    • B - batch size

    • H- feature map height

    • W - feature map width After post-process with argmax() function, the prediction result can be used to decide the action type of the current frame.

  2. History feature 1, name: fhis_out_0, shape: 12, 64, 2048, format: C, H, W,

  3. History feature 2, name: fhis_out_1, shape: 11, 64, 2048, format: C, H, W,

  4. History feature 3, name: fhis_out_2, shape: 11, 64, 2048, format: C, H, W,

  5. History feature 4, name: fhis_out_3, shape: 11, 64, 2048, format: C, H, W, where:

    • C - the channel number of feature vector

    • H- feature map height

    • W - feature map width